SlideShare a Scribd company logo
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Comparing the Performance of
OAI-PMH with ResourceSync
Petr Knoth, Matteo Cancellieri
Knowledge Media institute
The Open University
UK
Martin Klein
Research Library
Los Alamos National Laboratory
USA
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
“A single scientific repository is of limited value, real benefits
come from the ability to exchange data within a network …
… interoperability allows us to exploit today's computational
power so that we can aggregate, data mine, create new tools
and services, and generate new knowledge from repository
content.” - COAR
ResourceSync and repositories
2
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Protocols for data exchange are the blood of the
scholarly communication system
ResourceSync and repositories
3
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Aggregators and ResourceSync
4
ResourceSync
(CORE FastSync)
3rd parties
-data analysis
- TDM
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Repository aggregators have large full text collections
core.ac.uk stats:
• 13,117,488 Hosted full texts
• 135,539,113 Metadata records
• ~78m Links to full text
• 15TB of raw plain text
• 4,123 Data providers
5
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Many OAI-PMH implementations challenges …
Locating full text URLs in metadata
Restrictions on
full text downloading
Sequential nature of OAI-PMH
Failing resumption tokens
Incremental updates
Scalability
Metadata interoperability
Reliability
No content harvesting support
6
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Speed of OAI-PMH implementations
7
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Aggregators and ResourceSync
8
ResourceSync
(CORE FastSync)
3rd parties
-data analysis
- TDM
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Aggregators and ResourceSync
9
ResourceSync
(CORE FastSync)
3rd parties
-data analysis
- TDM
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Aggregators have a lot of usage
• January 2019 – CORE reached over 10M monthly active users for
the first time
• 571% increase from January 2018
• core.ac.uk by usage in the top 0.0009% of global websites
10
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Aggregator’s challenge
• Stay up to date despite thousands of data providers
• Efficiently expose large amounts of data to many users:
• Human users
• Machines (scalability!)
• OAI-PMH implementations can hardly deal with the job:
• Scalability
• Metadata inconsistency
• Supports for metadata harvesting only
11
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Research question
12
Is ResourceSync better suited for the job than
OAI-PMH?
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
OAI-PMH - Background
13
http://openarchives.org/pmh/
• Recurrent metadata exchange
from a Data Provider to Service
Providers
• XML metadata only
• Repository centric
• Devised 1999-2002, prior to
REST, prior to dominance of
web search engines
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
ResourceSync - Background
14
http://www.openarchives.org/rs/1.1/resourcesync
• Synchronization of resources
from a Source to Destinations
• Web resources, anything with
an HTTP URI & representation
• Resource centric
• Devised 2012-2013, leverages
key ingredients of web
interoperability, existing
specifications, existing Search
Engine Optimization practice
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
ResourceSync in a Nutshell
15
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
ResourceSync Capabilities
16
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
ResourceSync Capabilities
17
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
ResourceSync Capabilities
18
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
ResourceSync Capabilities
19
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
ResourceSync Capabilities
20
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Many to One - Aggregator
21
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
ResourceSync is based on Sitemaps
22
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
</url>
<url>
<loc>http://example.com/res2</loc>
<lastmod>2013-01-02T14:00:00Z</lastmod>
</url>
…
</urlset>
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
ResourceSync Resource List
23
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="resourcelist"
at="2019-06-11T09:00:00Z"
completed="2019-06-11T09:00:44Z" />
<url>
<loc>http://example.com/res1_metadata.xml</loc>
<lastmod>2019-06-02T13:00:00Z</lastmod>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="823"
type="text/xml" />
</url>
</urlset>
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Resource List with Link
24
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="resourcelist"
at="2019-06-11T09:00:00Z"
completed="2019-06-11T09:00:44Z" />
<url>
<loc>http://example.com/res1_metadata.xml</loc>
<lastmod>2019-06-02T13:00:00Z</lastmod>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="823"
type="text/xml" />
<rs:ln href="http://example.com/res1_content.pdf"
rel="describes"
length="8876"
type="application/pdf" />
</url>
</urlset>
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
• Designed to allow synchronization of resources, not just metadata
• Explicit link between metadata and the described resource
• Not prescriptive about the metadata format
• Web-centric
• Push-based Change Notifications (WebSub)
ResourceSync Characteristics
25
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
1. Assess the speed of OAI-PMH implementations across repositories
See results on slide #7
Comparative Analysis
26
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
1. Assess the speed of OAI-PMH implementations across repositories
2. Understand the recall in full-text harvesting
Comparative Analysis
27
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Recall of full-text harvesting – the power of the explicit full
text link
28
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
1. Assess the speed of OAI-PMH implementations across repositories
2. Understand the recall in full-text harvesting
3. Evaluate simulated metadata harvesting with ResourceSync
implementations for:
a) Standard Mode
• Resources sync’ed via Resource Lists, one resource at a time
(per HTTP transaction)
b) Resource Dump Mode
• Resources packaged into a Resource Dump, transferred via
one HTTP transaction
c) Batch Mode
• Resources are packaged into partial and on-demand
Resource Dumps, transferred via multiple HTTP transactions
4.
Comparative Analysis
29
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Speed simulated ResourceSync implementations
30
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Speed simulated ResourceSync implementations
31
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Why On Demand Resource Dump
• Many repositories have hundreds of OAI sets:
• Cannot materialize (too much data and processing requirements)
• Cannot rely on Resource List (too slow)
• HATEOAS approach:
https://blog.core.ac.uk/2018/03/17/increasing-the-speed-of-harvesting-
with-on-demand-resource-dumps/
32
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Recommendations for data providers
• Adopt ResourceSync at a platform level (Eprints, Dspace, Fedora, etc.)
• Many considerations:
• Support Change Lists? Dump? Naming of Capability Lists? On
Demand Dumps? How to link resources? WebSub?
• Guidelines needed!
• Resource List adoption only viable for small providers
• Support for on-demand Resource Dumps needed!
• ResourceSync Client-Server implementation available:
https://github.com/resync/resync
• CORE happy to benchmark repository platforms
• LANL working on validator
33
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
• OAI-PMH implementations vary substantially in terms of number of
records downloaded per second
• ResourceSync provides up to 10 times faster harvesting speeds with
Resource Dumps
• On-demand Resource Dumps for optimization
• Not yet part of the standard
• Thanks to resource linking, low recall less of an issue!
Take-aways
34
Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Comparing the Performance of
OAI-PMH with ResourceSync
Petr Knoth, Matteo Cancellieri
Knowledge Media institute
The Open University
UK
Martin Klein
Research Library
Los Alamos National Laboratory
USA

More Related Content

PPT
Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Rese...
DOCX
Map reduce advantages over parallel databases report
PPTX
State of enterprise data science
PDF
Just-Right Consistency: Closing The CAP Gap
PPTX
Inspire hack 2017-linked-data
PDF
Overview of Dan Olteanu's Research presentation
PDF
Mansi chowkkar programming_in_data_analytics
PPTX
The Graph Structure of the Web - Aggregated by Pay-Level Domain
Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Rese...
Map reduce advantages over parallel databases report
State of enterprise data science
Just-Right Consistency: Closing The CAP Gap
Inspire hack 2017-linked-data
Overview of Dan Olteanu's Research presentation
Mansi chowkkar programming_in_data_analytics
The Graph Structure of the Web - Aggregated by Pay-Level Domain

Similar to Comparing the Performance of OAI-PMH with ResourceSync (20)

PDF
Trivadis TechEvent 2016 DWH Modernization – in the Age of Big Data by Gregor ...
PDF
Intact danish workshop_20171001
PDF
The Linked Data Lifecycle
PPTX
Hadoop Training
PPT
Putting the L in front: from Open Data to Linked Open Data
PDF
Team 05 linked data generation
PPTX
OpenAIRE Open Innovation call: Next Generation Repositories
PDF
Holistic Benchmarking of Big Linked Data: HOBBIT
ODP
Now you can cite APHRC's data sets (CHAIN-REDS)
PPTX
EDF2013: Invited talk Florian Bauer: Unleashing climate and energy knowledge ...
PDF
Flink Forward Berlin 2018: Tobias Lindener - "Approximate standing queries on...
PPTX
flight data analysis using big data
PPTX
[Nuxeo World 2013] CAPGEMINI NL AND NUXEO: ONE YEAR LATER, GREAT THINGS HAVE ...
PDF
OpenAIRE webinar. Open Research Data in H2020
PPTX
GtoPdb Database Status Report - April 2019
PDF
TechEvent Customer Project "Trend-Analytics"
PDF
SplunkLive! Munich 2019: Splunking Parcels with Deutsche Post DHL
PPTX
Flink Meetup Septmeber 2017 2018
PPTX
Easy SPARQLing for the Building Performance Professional
Trivadis TechEvent 2016 DWH Modernization – in the Age of Big Data by Gregor ...
Intact danish workshop_20171001
The Linked Data Lifecycle
Hadoop Training
Putting the L in front: from Open Data to Linked Open Data
Team 05 linked data generation
OpenAIRE Open Innovation call: Next Generation Repositories
Holistic Benchmarking of Big Linked Data: HOBBIT
Now you can cite APHRC's data sets (CHAIN-REDS)
EDF2013: Invited talk Florian Bauer: Unleashing climate and energy knowledge ...
Flink Forward Berlin 2018: Tobias Lindener - "Approximate standing queries on...
flight data analysis using big data
[Nuxeo World 2013] CAPGEMINI NL AND NUXEO: ONE YEAR LATER, GREAT THINGS HAVE ...
OpenAIRE webinar. Open Research Data in H2020
GtoPdb Database Status Report - April 2019
TechEvent Customer Project "Trend-Analytics"
SplunkLive! Munich 2019: Splunking Parcels with Deutsche Post DHL
Flink Meetup Septmeber 2017 2018
Easy SPARQLing for the Building Performance Professional
Ad

More from Martin Klein (20)

PPTX
On the Persistence of Persistent Identifiers of the Scholarly Web
PPTX
On the Persistence of Persistent Identifiers of the Scholarly Web
PPTX
An Institutional Perspective to Rescue Scholarly Orphans
PPTX
Who is Asking - Humans and Machines Experience a Different Scholarly Web
PPTX
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
PPTX
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
PPTX
Evaluating Memento Service Optimizations
PPTX
An Institutional Perspective to Rescue Scholarly Orphans
PPTX
A Vision of the Library’s Role in Archiving Scholarly Artifacts
PPTX
First Steps in Research Data Management Under Constraints of a National Secur...
PPTX
Smart Routing of Memento Requests
PPTX
Building Event Collections from Crawling Web Archives
PPTX
A Web-Centric Pipeline for Archiving Scholarly Artifacts
PPTX
Focused Crawl of Web Archives to Build Event Collections
PPTX
Creating Topical Collections: Web Archives vs. Live Web
PPTX
Robust Linking to Web Resources
PPTX
Signposting for Repositories
PPTX
Discovering Scholarly Orphans Using ORCID
PPTX
Using the Memento Framework to Assess Content Drift in Scholarly Communication
PPTX
Uniform Access to Raw Mementos
On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
An Institutional Perspective to Rescue Scholarly Orphans
Who is Asking - Humans and Machines Experience a Different Scholarly Web
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Evaluating Memento Service Optimizations
An Institutional Perspective to Rescue Scholarly Orphans
A Vision of the Library’s Role in Archiving Scholarly Artifacts
First Steps in Research Data Management Under Constraints of a National Secur...
Smart Routing of Memento Requests
Building Event Collections from Crawling Web Archives
A Web-Centric Pipeline for Archiving Scholarly Artifacts
Focused Crawl of Web Archives to Build Event Collections
Creating Topical Collections: Web Archives vs. Live Web
Robust Linking to Web Resources
Signposting for Repositories
Discovering Scholarly Orphans Using ORCID
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Uniform Access to Raw Mementos
Ad

Recently uploaded (20)

PPTX
newyork.pptxirantrafgshenepalchinachinane
PDF
Introduction to the IoT system, how the IoT system works
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PPTX
INTERNET------BASICS-------UPDATED PPT PRESENTATION
PPT
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
PDF
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
PPT
tcp ip networks nd ip layering assotred slides
PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PDF
Smart Home Technology for Health Monitoring (www.kiu.ac.ug)
PPTX
Mathew Digital SEO Checklist Guidlines 2025
PPTX
E -tech empowerment technologies PowerPoint
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PPTX
Internet___Basics___Styled_ presentation
PDF
Tenda Login Guide: Access Your Router in 5 Easy Steps
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PDF
Paper PDF World Game (s) Great Redesign.pdf
PPTX
artificial intelligence overview of it and more
PPTX
innovation process that make everything different.pptx
PPTX
522797556-Unit-2-Temperature-measurement-1-1.pptx
PPTX
Introduction to Information and Communication Technology
newyork.pptxirantrafgshenepalchinachinane
Introduction to the IoT system, how the IoT system works
The New Creative Director: How AI Tools for Social Media Content Creation Are...
INTERNET------BASICS-------UPDATED PPT PRESENTATION
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
tcp ip networks nd ip layering assotred slides
Decoding a Decade: 10 Years of Applied CTI Discipline
Smart Home Technology for Health Monitoring (www.kiu.ac.ug)
Mathew Digital SEO Checklist Guidlines 2025
E -tech empowerment technologies PowerPoint
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
Internet___Basics___Styled_ presentation
Tenda Login Guide: Access Your Router in 5 Easy Steps
Module 1 - Cyber Law and Ethics 101.pptx
Paper PDF World Game (s) Great Redesign.pdf
artificial intelligence overview of it and more
innovation process that make everything different.pptx
522797556-Unit-2-Temperature-measurement-1-1.pptx
Introduction to Information and Communication Technology

Comparing the Performance of OAI-PMH with ResourceSync

  • 1. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Comparing the Performance of OAI-PMH with ResourceSync Petr Knoth, Matteo Cancellieri Knowledge Media institute The Open University UK Martin Klein Research Library Los Alamos National Laboratory USA
  • 2. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany “A single scientific repository is of limited value, real benefits come from the ability to exchange data within a network … … interoperability allows us to exploit today's computational power so that we can aggregate, data mine, create new tools and services, and generate new knowledge from repository content.” - COAR ResourceSync and repositories 2
  • 3. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Protocols for data exchange are the blood of the scholarly communication system ResourceSync and repositories 3
  • 4. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Aggregators and ResourceSync 4 ResourceSync (CORE FastSync) 3rd parties -data analysis - TDM
  • 5. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Repository aggregators have large full text collections core.ac.uk stats: • 13,117,488 Hosted full texts • 135,539,113 Metadata records • ~78m Links to full text • 15TB of raw plain text • 4,123 Data providers 5
  • 6. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Many OAI-PMH implementations challenges … Locating full text URLs in metadata Restrictions on full text downloading Sequential nature of OAI-PMH Failing resumption tokens Incremental updates Scalability Metadata interoperability Reliability No content harvesting support 6
  • 7. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Speed of OAI-PMH implementations 7
  • 8. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Aggregators and ResourceSync 8 ResourceSync (CORE FastSync) 3rd parties -data analysis - TDM
  • 9. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Aggregators and ResourceSync 9 ResourceSync (CORE FastSync) 3rd parties -data analysis - TDM
  • 10. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Aggregators have a lot of usage • January 2019 – CORE reached over 10M monthly active users for the first time • 571% increase from January 2018 • core.ac.uk by usage in the top 0.0009% of global websites 10
  • 11. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Aggregator’s challenge • Stay up to date despite thousands of data providers • Efficiently expose large amounts of data to many users: • Human users • Machines (scalability!) • OAI-PMH implementations can hardly deal with the job: • Scalability • Metadata inconsistency • Supports for metadata harvesting only 11
  • 12. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Research question 12 Is ResourceSync better suited for the job than OAI-PMH?
  • 13. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany OAI-PMH - Background 13 http://openarchives.org/pmh/ • Recurrent metadata exchange from a Data Provider to Service Providers • XML metadata only • Repository centric • Devised 1999-2002, prior to REST, prior to dominance of web search engines
  • 14. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany ResourceSync - Background 14 http://www.openarchives.org/rs/1.1/resourcesync • Synchronization of resources from a Source to Destinations • Web resources, anything with an HTTP URI & representation • Resource centric • Devised 2012-2013, leverages key ingredients of web interoperability, existing specifications, existing Search Engine Optimization practice
  • 15. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany ResourceSync in a Nutshell 15
  • 16. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany ResourceSync Capabilities 16
  • 17. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany ResourceSync Capabilities 17
  • 18. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany ResourceSync Capabilities 18
  • 19. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany ResourceSync Capabilities 19
  • 20. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany ResourceSync Capabilities 20
  • 21. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Many to One - Aggregator 21
  • 22. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany ResourceSync is based on Sitemaps 22 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T14:00:00Z</lastmod> </url> … </urlset>
  • 23. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany ResourceSync Resource List 23 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" at="2019-06-11T09:00:00Z" completed="2019-06-11T09:00:44Z" /> <url> <loc>http://example.com/res1_metadata.xml</loc> <lastmod>2019-06-02T13:00:00Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="823" type="text/xml" /> </url> </urlset>
  • 24. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Resource List with Link 24 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" at="2019-06-11T09:00:00Z" completed="2019-06-11T09:00:44Z" /> <url> <loc>http://example.com/res1_metadata.xml</loc> <lastmod>2019-06-02T13:00:00Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="823" type="text/xml" /> <rs:ln href="http://example.com/res1_content.pdf" rel="describes" length="8876" type="application/pdf" /> </url> </urlset>
  • 25. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany • Designed to allow synchronization of resources, not just metadata • Explicit link between metadata and the described resource • Not prescriptive about the metadata format • Web-centric • Push-based Change Notifications (WebSub) ResourceSync Characteristics 25
  • 26. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany 1. Assess the speed of OAI-PMH implementations across repositories See results on slide #7 Comparative Analysis 26
  • 27. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany 1. Assess the speed of OAI-PMH implementations across repositories 2. Understand the recall in full-text harvesting Comparative Analysis 27
  • 28. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Recall of full-text harvesting – the power of the explicit full text link 28
  • 29. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany 1. Assess the speed of OAI-PMH implementations across repositories 2. Understand the recall in full-text harvesting 3. Evaluate simulated metadata harvesting with ResourceSync implementations for: a) Standard Mode • Resources sync’ed via Resource Lists, one resource at a time (per HTTP transaction) b) Resource Dump Mode • Resources packaged into a Resource Dump, transferred via one HTTP transaction c) Batch Mode • Resources are packaged into partial and on-demand Resource Dumps, transferred via multiple HTTP transactions 4. Comparative Analysis 29
  • 30. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Speed simulated ResourceSync implementations 30
  • 31. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Speed simulated ResourceSync implementations 31
  • 32. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Why On Demand Resource Dump • Many repositories have hundreds of OAI sets: • Cannot materialize (too much data and processing requirements) • Cannot rely on Resource List (too slow) • HATEOAS approach: https://blog.core.ac.uk/2018/03/17/increasing-the-speed-of-harvesting- with-on-demand-resource-dumps/ 32
  • 33. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Recommendations for data providers • Adopt ResourceSync at a platform level (Eprints, Dspace, Fedora, etc.) • Many considerations: • Support Change Lists? Dump? Naming of Capability Lists? On Demand Dumps? How to link resources? WebSub? • Guidelines needed! • Resource List adoption only viable for small providers • Support for on-demand Resource Dumps needed! • ResourceSync Client-Server implementation available: https://github.com/resync/resync • CORE happy to benchmark repository platforms • LANL working on validator 33
  • 34. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany • OAI-PMH implementations vary substantially in terms of number of records downloaded per second • ResourceSync provides up to 10 times faster harvesting speeds with Resource Dumps • On-demand Resource Dumps for optimization • Not yet part of the standard • Thanks to resource linking, low recall less of an issue! Take-aways 34
  • 35. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Comparing the Performance of OAI-PMH with ResourceSync Petr Knoth, Matteo Cancellieri Knowledge Media institute The Open University UK Martin Klein Research Library Los Alamos National Laboratory USA