SlideShare a Scribd company logo
NISO Training
http://www.niso.org/workrooms/onixpl-encoding/

NISO Training
ResourceSync: A Web-Based
Resource Synchronization
Framework
December 3, 2013
Speakers:
Bernhard Haslhofer - Postdoc Research Associate
Department of Computer Science, University of Vienna
Simeon Warner - Information Science, Cornell University
ResourceSync:
A Web-Based
Resource Synchronization
Framework

#resourcesync

ResourceSync is funded by
The Sloan Foundation & JISC
ResourceSync Webinar
December 3 2013

2
This is a short version of the complete ResourceSync tutorial,
which is available at
http://www.slideshare.net/OpenArchivesInitiative/resourcesync-tutorial

ResourceSync Webinar
December 3 2013

3
ResourceSync Tutorial History
• OAI8, June 2013 – Open Repositories, July 2013 –
JCDL, July 2013 – TPDL 2013, September 2013 –LITA
Forum, November 2013, SWIB November 2013, …

Presenters

Simeon Warner
Cornell University

Bernhard Haslhofer
University of Vienna
ResourceSync Webinar
December 3 2013
4
ResourceSync Tutorial Contributors

Martin Klein
Herbert Van de Sompel
Robert Sanderson
Los Alamos National Laboratory Los Alamos National Laboratory Los Alamos National Laboratory
<martinklein0815@gmail.com>
<hvdsomp@gmail.com>
<azaroth24@gmail.com>
@mart1nkle1n
@hvdsomp
@azaroth24

Simeon Warner
Cornell University
<simeon.warner@cornell.edu>
@zimeon

Michael L. Nelson
Old Dominion University
<mln@cs.odu.edu>
@phonedude_mln

Richard Jones
Cottage Labs
<richard@cottagelabs.com>
@cottagelabs
ResourceSync Webinar
December 3 2013
5
OAI
Herbert Van de Sompel
Martin Klein
Robert Sanderson
(Los Alamos National Laboratory)
Simeon Warner
(Cornell University)

NISO
Todd Carpenter
Nettie Lagace
University of Oxford
Graham Klyne

Bernhard Haslhofer
(University of Vienna)
Michael L. Nelson
(Old Dominion University)

Lyrasis
Peter Murray

Carl Lagoze
(University of Michigan)

ResourceSync Webinar
December 3 2013

6
ResourceSync Technical Group
LOCKSS
Ex Libris Inc.
Shlomo Sanders

David Rosenthal

JISC
Richard Jones

Paul Walk

Stuart Lewis

RedHat
OCLC
Christian Sadilek

Library of Congress

Jeff Young

Kevin Ford

ResourceSync Webinar
December 3 2013

7
Timeline, Status of Specification(s)
• August 2013
o

o

Release of ResourceSync framework Core specification
- Version 0.9.1
Public draft of ResourceSync Archives specification released

• September 2013
o

Core specification on its way to become an ANSI standard

• November 2013
o

Internal draft of ResourceSync Notification specification

• January 2014
o

Public draft of ResourceSync Notification specification

• Mid 2014
o

Core specification becomes ANSI/NISO standard

ResourceSync Webinar
December 3 2013

8
Pointers
• Specification
http://www.openarchives.org/rs/
http://www.openarchives.org/rs/resourcesync
http://www.openarchives.org/rs/archives

• List for public comment
https://groups.google.com/d/forum/resourcesync
• Client and simulator code
http://github.org/resync/resync
http://github.org/resync/simulator

ResourceSync Webinar
December 3 2013

9
ResourceSync - Agenda
1. ResourceSync: Problem Perspective & Conceptual
Approach

2. Motivation & Use Cases
3. Framework Walkthrough

4. Framework (Technical) Details
5. Implementation
6. Q&A
ResourceSync Webinar
December 3 2013

10
ResourceSync - Agenda
1. ResourceSync: Problem Perspective & Conceptual
Approach

ResourceSync Webinar
December 3 2013

11
Synchronize What?

• Web resources
o things with a URI that can be dereferenced
• Focus on needs of research communication and cultural heritage
organizations
o but aim for generality

ResourceSync Webinar
December 3 2013

12
Synchronize What?
• Small websites/repositories (a few resources) to large
repositories/datasets/linked data collections (many millions of
resources)

sync

sync

ResourceSync Webinar
December 3 2013

13
Synchronize What?
• Low change frequency (weeks/months) to high change
frequency (seconds)
• Synchronization latency and accuracy needs may vary
sync

sync

sync

ResourceSync Webinar
December 3 2013

14
Why?
… because lots of projects and services are doing synchronization
but have to resort to ad-hoc, case by case, approaches!
• Project team involved with projects that need this

• Experience with OAI-PMH: widely used in repos but
o XML metadata only
o Attempts at synchronizing actual content via OAI-PMH
(complex object formats, dc:identifier) not successful.
o Web technology has moved on since 1999
• Devise a shared solution for data, metadata, linked data?

ResourceSync Webinar
December 3 2013

15
ResourceSync Problem
• Consideration:
• Source (server) A has resources that change over time: they
get created, modified, deleted
• Destination (servers) X, Y, and Z leverage (some)
resources of Source A.
• Problem:
• Destinations want to keep in step with the resource changes
at Source A: resource synchronization.
• Goal:
• Design an approach for resource synchronization aligned
with the Web Architecture that has a fair chance of adoption
by different communities.
• The approach must scale better than recurrent HTTP
HEAD/GET on resources.

ResourceSync Webinar
December 3 2013

16
Source: Core Synchronization Capabilities

P
U
L
L

1. Describing content – publish a list of resources available for
synchronization to enable Destinations to perform an initial load
or catch-up with a Source
2. Packaging content – bundle resources to enable bulk download
by destinations
3. Describing changes – publish a list of resource changes to
enable destinations to stay synchronized and decrease latency
4. Packaging changes – bundle resource changes for bulk
download by destinations

ResourceSync Webinar
December 3 2013

17
Source: Notifications Capabilities
To reduce synchronization latency and to optimize the synchronization
process the Source can support:

P
•
U
S
•
H

1. Change Notification
• Notifies about changes to particular resources
• e.g., resource A has been updated | created | deleted
2. Framework Notification
• Notifies about changes to capabilities i.e., their documents
• e.g., a Change List has been updated | created | deleted

ResourceSync Webinar
December 3 2013

18
Source: Synchronization Features
1. Discovery of capabilities – support Destinations in discovering
all offered capabilities
o

Applies to PULL, PUSH, capabilities

1. Linking to related resources – provide links from resources
subject to synchronization to related resources
o

Applies to PULL, PUSH capabilities

ResourceSync Webinar
December 3 2013

19
Destination: Synchronization Needs
1. Baseline synchronization – A destination must be able to
perform an initial load or catch-up with a source
- avoid out-of-band setup
2. Incremental synchronization – A destination must have some
way to keep up-to-date with changes at a source
- subject to some latency; minimal: create/update/delete
- allow to catch-up after destination has been offline
3. Audit – A destination should be able to determine whether it is
synchronized with a source
- regarding coverage and accuracy

ResourceSync Webinar
December 3 2013

20
ResourceSync - Agenda

2. Motivation & Use Cases

ResourceSync Webinar
December 3 2013

21
Use Case 1: arXiv Mirroring and Data Sharing
• Repository of scholarly articles in
physics, mathematics, computer
science, etc.
• > 850k articles
• approx. 1.5 revisions per article on
average
• approx. 75k new articles per year
• Each article has full-text and separate
metadata record
• approx. 3.8M resources

ResourceSync Webinar
December 3 2013

22
Use Case 1: arXiv Mirroring and Data Sharing
• 2,700 updates daily
o at 8pm EST
o Currently using homebrew mirroring
solution (running with minor
modifications since 1994!)
o occasional rsync (file systemspecific, auth issues)

ResourceSync Webinar
December 3 2013

23
Use Case 1: arXiv
Mirroring / Data Sharing
• GOAL: Keep mirror sites synchronized with daily
changes
• WANT:
o
o
o
o
o

o

high consistency
moderate latency
robustness to global network outages (low admin effort)
ability to verify sync status in case of questions
low admin effort (i.e. standard approach, standard tools)
reasonable consistency, latency, efficiency

ResourceSync Webinar
December 3 2013

24
Use Case 2: DBpedia Live Duplication
• Average of 2 updates per second
• Low latency desirable => need for a push technology

ResourceSync Webinar
December 3 2013

25
ResourceSync - Agenda

3. Framework Walkthrough

ResourceSync Webinar
December 3 2013

26
Source Capability 1: Describing Content
In order to advertise the resources that a source wants destinations
to know about, it may describe them:
o

o

Publish a Resource List, a list of resource URIs and possibly
associated metadata
- Destination GETs the Resource List
- Destination GETs listed resources by their URI
A Resource List describes the state of a set of resources at
one point in time (snapshot)

ResourceSync Webinar
December 3 2013

27
28
29
Source Capability 2: Packaging Content
By default, content is transferred in response to a GET issued by a
destination against a URI of a source’s resource. But a source may
support additional mechanisms:
o

o

Publish a Resource Dump, a document that points to
packages of resource representations and necessary
metadata
- Destination GETs the package
- Destination unpacks the package
- ZIP format supported
A Resource Dump and the packages it points to reflect the
state of a set of resources at one point in time (snapshot)

ResourceSync Webinar
December 3 2013

30
31
32
Source Capability 3: Describing Changes
In order to achieve lower latency and/or greater efficiency, a source
may communicate about changes to its resources:
o

o

Publish a Change List, a list of recent change events
(created, updated, deleted resource)
- Destination acts upon change events, e.g. GETs
created/updated resources, removes deleted resources.
A Change List pertains to resources that changed in a
temporal interval with a start- and an end-date
- If a resource changed more than once, it will be listed
more than once

ResourceSync Webinar
December 3 2013

33
34
35
36
Source Capability 4: Packaging Changes
In order to reduce the number of requests to obtain resource
changes, a source may provide packaged bitstreams for changed
resources:
o

o

Publish a Change Dump, a document that points to
packages containing bitstreams of recently changed
resource and necessary metadata
- Destination GETs the package
- Destination unpacks the package
- ZIP format supported
A Change Dump and its packages pertain to resources that
changed in a temporal interval with a start- and an end-date
- If a resource changed more than once, it will be included
more than once
ResourceSync Webinar
December 3 2013

37
38
Destination: Key Processes

ResourceSync Webinar
December 3 2013

39
ResourceSync - Agenda

4. Framework (Technical) Details

ResourceSync Webinar
December 3 2013

40
So Many Choices
Push

DSNotify
OAI-PMH
rsync

Crawl

Pull
OAI-ORE

RDFsync

WebDAV Col. Syn.

XMPP
Atom

SWORD
Sitemap

SPARQLpush

SDShare

AtomPub

RSS
PubSubHubbub

XMPP
ResourceSync Webinar
December 3 2013

41
So Many Choices
Push

DSNotify
OAI-PMH
rsync

Crawl

Pull
OAI-ORE

RDFsync

WebDAV Col. Syn.

XMPP
Atom

SWORD
Sitemap

SPARQLpush

SDShare

AtomPub

RSS
PubSubHubbub

XMPP
ResourceSync Webinar
December 3 2013

42
ResourceSync Webinar
December 3 2013

43
Sitemap Format

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9”>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
</url>

<url>
<loc>http://example.com/res2</loc>
<lastmod>2013-01-02T14:00:00Z</lastmod>
</url>
…
</urlset>

ResourceSync Webinar
December 3 2013

44
ResourceSync Sitemap Extensions
<urlset xmlns=http://www.sitemaps.org/schemas/sitemap/0.9
xmlns:rs="http://www.openarchives.org/rs/terms/”>
<rs:ln …/>
<rs:md …/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:ln …/>
<rs:md …/>
</url>
<url>
…
</url>
</urlset>

ResourceSync Webinar
December 3 2013

45
Related Resource Metadata Summary
• Attributes of the <rs:ln> element; c.f. resource metadata + pri
Element/Attribute Description

Defined by

<rs:ln>

ResourceSync

encoding

HTTP Content-Encoding header value

RFC2616

hash

One or more content digests (md5, sha-1, sha-256)

Atom Link Ext.

href

Related resource URI (identity)

RFC4287

length

HTTP Content-Length header value

RFC4287

modified

Timestamp of last change (c.f. <lastmod>)

Atom Link Ext.

path

Path in ZIP package (Dump Manifests only)

ResourceSync

pri

Priority of link

RFC6249

rel

Relation - IANA registered or URI

RFC4287

type

HTTP Content-Type header value

RFC4287
ResourceSync Webinar
December 3 2013
Resource Metadata Summary
Element/Attribute
<loc>
<lastmod>

Description
Resource URI (identity)
Timestamp of last change

Defined by
sitemaps
sitemaps

<changefreq>

Expected update frequency

sitemaps

<rs:md>
change
encoding

hash
length
path
type

ResourceSync
Change type (Change List & Change
Dump Manifest only)

ResourceSync

HTTP Content-Encoding header value

RFC2616

One or more content digests (md5, sha-1, Atom Link Ext.
sha-256)

HTTP Content-Length header value

RFC4287

Path in ZIP package (Dump Manifests
only)
HTTP Content-Type header value

ResourceSync

RFC4287

ResourceSync Webinar
December 3 2013
Link Relation Summary
Relation

Use in ResourceSync

Defined in

rel="alternate"

Link from generic to specific URI

HTML 5

rel="canonical"

Link from specific to generic URI

RFC6596

rel="collection"

Resource is member of collection

RFC6573

rel="contents"

Link from dump to manifest

rel="describedby"

Has metadata

HTML4
Protocol for Web Description Resources
(POWDER): Description Resources

rel="describes"

Is metadata for

The 'describes' Link Relation Type

rel="duplicate"

RFC6249

rel=".../rs/terms/patch"

Mirror or alternative copy
A patch -- efficient change
information

rel="memento"

Link to time-specific URI

Memento Internet Draft

rel="timegate"

Link to timegate

Memento Internet Draft

rel="via"

Provenance chain, came from

RFC4287

This specification

ResourceSync Webinar
December 3 2013
Resource List
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="resourcelist"
at="2013-01-03T09:00:00Z”
completed="2013-01-03T09:01:00Z” />
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"/>
</url>
<url>
…
</url>
</urlset>

ResourceSync Webinar
December 3 2013

49
Resource List
• Describe Source’s resources that are subject to synchronization
• At one point in time (snapshot)
• Creation can take some time – duration can be conveyed
• Typical Destination use: Baseline Synchronization, Audit

• Each URI typically listed only once
• Might be expensive to generate
• Destinations use @at to determine freshness
• [@at, @completed] – interval of uncertainty
• Destination issues GETs against URIs to obtain resources
• Very similar to current Sitemaps

ResourceSync Webinar
December 3 2013

50
Resource Dump
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”resourcedump"
at="2013-01-02T09:00:00Z”/>
<url>
<loc>http://example.com/resourcedump_part1.zip</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md length=”97553"
type=”application/zip"/>
<rs:ln rel=”contents”
href="http://example.com/resourcedump_manifest-part1.xml"
type=”application/xml"/>
</url>
<url>
<loc>http://example.com/resourcedump_part2.zip</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
</url>
</urlset>
ResourceSync Webinar
December 3 2013

51
Resource Dump Manifest
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”resourcedump-manifest"
at="2013-01-02T09:00:00Z”/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md type="text/html"
path=”/resources/res1"/>
</url>
<url>
<loc>http://example.com/res2</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md type=”application/pdf”
path=”/resources/res2"/>
</url>
</urlset>

ResourceSync Webinar
December 3 2013

52
Resource Dump
• A Resource Dump points to packages (ZIP files) that contain
representations of the Source’s resources
• At one point in time (snapshot)
• Resource Dump is mandatory, even if there is only one ZIP file
• ZIP package contains manifest, listing contained bitstreams
• Typical Destination use: Baseline Synchronization, bulk
download

• Each URI typically listed only once
• Might be expensive to generate
• Destinations use @at to determine freshness
• [@at, @completed] – interval of uncertainty
• GETs against individual URIs from Resource List achieves the
same result (ignoring varying freshness)
ResourceSync Webinar
December 3 2013

53
Change List
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”changelist"
from="2013-01-02T09:00:00Z”
until="2013-01-03T09:00:00Z”/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md change=”updated"
hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"/>
</url>
<url>
…
</url>
</urlset>

ResourceSync Webinar
December 3 2013

54
Change List
• A Change List pertains to a Source’s resources that changed
• Changes that occurred during a temporal interval with startand end-date
• Typical Destination use: Incremental Synchronization, Audit
• Changes are listed in chronological order
• Multiple changes to one resource results in the resource being
listed multiple times, once per change
• Source determines duration of temporal interval
• Destinations use @from and @until to determine freshness
• Destinations issue GETs against URIs to obtain changed
resources

ResourceSync Webinar
December 3 2013

55
Discovery of Capabilities
Requirements:
• Need to discover capabilities, i.e. Resource List, Resource
Dump, Change List, Change Dump, Archives, Notification
channels
• Need to know the type of capability each document
represents.
Approach:
• The Source publishes a Capability List that enumerates the
capabilities it supports.
• By pointing at Resource List, Change List, Resource Dump,
etc. using appropriate relation types, e.g. “resourcelist”,
“changelist”, “resourcedump” etc.
http://www.openarchives.org/rs/resourcesync#CapabilityList
ResourceSync Webinar
December 3 2013

56
Discovery of Capabilities

ResourceSync Webinar
December 3 2013

57
Capability List
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability=”capabilitylist”/>
<url>
<loc>http://example.com/dataset1/resourcelist.xml</loc>
<rs:md capability=”resourcelist”/>
</url>
<url>
<loc>http://example.com/dataset1/changelist.xml</loc>
<rs:md capability=”changelist”/>
</url>
<url>
<loc>http://example.com/dataset1/resourcedump.xml</loc>
<rs:md capability=”resourcedump”/>
</url>
</urlset>

ResourceSync Webinar
December 3 2013

58
Discovery of Capability Lists
Requirements:
• Need to discover a Capability List
Approaches:
• Introduce a link in the HTTP Link header of a resources that is
subject to synchronization, pointing at the Capability List with the
relation type “resourcesync”
• Introduce a link from an HTML document that is subject to
synchronization (<head> section), pointing at the Capability List
with the relation type “resourcesync”
• Link from a Resource List, etc. to the Capability List with the
relation type “up”
Link header on example.com/res1.pdf
Link: <example.com/dataset1/capabilitylist.xml>;rel=“resourcesync”
ResourceSync Webinar
December 3 2013

59
Discovery via robots.txt
• Resource Lists are (enhanced) Sitemaps
• Sitemaps can be discovered via robots.txt
• Ergo, Resource Lists should be discoverable via robots.txt
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Sitemap: http://example.com/dataset1/resourcelist.xml

ResourceSync Webinar
December 3 2013

60
Framework Structure

ResourceSync Webinar
December 3 2013

61
Motivation for Notifications
•

Reduce synchronization latency by having the Source push out
resource change information
• To avoid continuous pull of Change Lists by Destinations

•

Share information about changes to the Source’s
ResourceSync implementation, e.g. announcement of new
Resource List, new Capability List, etc.
• To avoid continuous polling of e.g. Resource Lists,
ResourceSync Description

ResourceSync Webinar
December 3 2013

62
Source: Notification Capabilities
•

P
U
•
S
H

1. Change Notification
• Notifies about changes to particular resources
• e.g., resource A has been updated | created | deleted
2. Framework Notification
• Notifies about changes to capabilities i.e., their documents
• e.g., a Change List has been updated | created | deleted
• Also for Capability Lists and Source Description

ResourceSync Webinar
December 3 2013

63
Notification Channels
•

Notification sent via channels
• Resource Notification: one channel per set of resources
• Framework Notification: one channel per set of resources
• Sent on level of capability document, not on index-level
• Notifications about changes to Source Description sent on all
Framework Notification channels

•

Payload for notifications: <urlset> documents

•

Transport protocol for notifications under discussion:
• PubSubHubbub https://pubsubhubbub.googlecode.com/git/pubsubhubbub-core0.4.html - current choice
• WebSockets -http://tools.ietf.org/html/rfc6455 – may be added
later
ResourceSync Webinar
December 3 2013

64
Change Notification Payload
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="up"
href="http://example.com/dataset1/capabilitylist.xml"/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T09:07:00Z</lastmod>
<rs:md change=”updated"
hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"/>
</url>
<url>
…
</url>
</urlset>

ResourceSync Webinar
December 3 2013

65
ResourceSync - Agenda

5. Implementation

ResourceSync Webinar
December 3 2013

66
DSpace support for
metadata harvesting use case

DSpace Module:
https://github.com/CottageLabs/DSpaceResourceSync
PHP client:
https://github.com/stuartlewis/resync-php
http://mydspace.edu/dspace-rs/resource/123456789/7/qdc
ResourceSync webapp

Item handle

Metadata Format
ResourceSync Webinar
December 3 2013

67
ResourceSync @ arXiv
• Use ResourceSync for both
mirroring and public data access
o efficient updates
o ability to do periodic audits
o public synchronization capability
o reduce admin burden
• Start with metadata + source for
mirroring use case (doing
experiments now)
• Open Access use cases require
processed PDF also
ResourceSync Webinar
December 3 2013

68
Getting a copy of arXiv
It might be as easy as:

(of course, you probably have to wait a while but it is nice to know ResourceSync is
stateless so one can efficiently restart)

ResourceSync Webinar
December 3 2013

69
Python Library and Client
• Aim to provide library code implementing all ResourceSync
facilities for use in both source and destination implementations
o

Designed for python 2.6 (RHEL6) and 2.7

• Client (resync) supports many destination operations, inspired
by the common Unix rsync program
• Client also supports some operations that might be useful in a
source, such as generation of static Resource Lists, or periodic
Change Lists (used in arXiv experiments)
• Explorer (resync-explorer) intended to allow easy inspection
of a source’s resource sets and capabilities
• Developed since ResourceSync v0.5, updated for v0.9.1
http://github.org/resync/resync
On pypi: “easy_install resync”

ResourceSync Webinar
December 3 2013
ResourceSync Source Simulator
• Python code using Tornado server
• Provides random set of resources of different sizes updated at a
particular rate
• Very useful for testing Destination code

http://github.com/resync/simulator

ResourceSync Webinar
December 3 2013
ResourceSync - Agenda

6. Q&A
ResourceSync Webinar
December 3 2013

72
ResourceSync:
A Web-Based
Resource Synchronization
Framework

#resourcesync

ResourceSync is funded by
The Sloan Foundation & JISC
ResourceSync Webinar
December 3 2013

73
THANK YOU
We look forward to seeing you at a
future NISO training event.

More Related Content

PDF
ResourceSync: Web-Based Resource Synchronization
PDF
ResourceSync: Web-based Resource Synchronization
PDF
Carpenter - Wolfram Data Summit ResourceSync
PPTX
ResourceSync Overview
PPTX
ResourceSync Quick Overview
PPTX
ResourceSync Tutorial
PPTX
ResourceSync Introduction at SWIB13
PPTX
Persistent Identifiers and the Web: The Need for an Unambiguous Mapping
ResourceSync: Web-Based Resource Synchronization
ResourceSync: Web-based Resource Synchronization
Carpenter - Wolfram Data Summit ResourceSync
ResourceSync Overview
ResourceSync Quick Overview
ResourceSync Tutorial
ResourceSync Introduction at SWIB13
Persistent Identifiers and the Web: The Need for an Unambiguous Mapping

What's hot (20)

PDF
ResourceSync: Conceptual and Technical Problem Perspective
PPTX
"Web Archive services framework for tighter integration between the past and ...
PDF
Maintaining scholarly standards in the digital age: Publishing historical gaz...
PDF
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
PPTX
Hadoop ecosystem for health/life sciences
PPTX
Reminiscing about interoperability
PPTX
PPTX
Scaling up Linked Data
PDF
Annotating Scholarly Resources
PPTX
DBpedia Archive using Memento, Triple Pattern Fragments, and HDT
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
PPTX
Druid Scaling Realtime Analytics
PPTX
Large Scale Data With Hadoop
PPTX
Signposting for Repositories
PDF
20131205 hadoop-hdfs-map reduce-introduction
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
PDF
Introduction to Hadoop and MapReduce
PDF
Hadoop Family and Ecosystem
PPTX
List of Engineering Colleges in Uttarakhand
ResourceSync: Conceptual and Technical Problem Perspective
"Web Archive services framework for tighter integration between the past and ...
Maintaining scholarly standards in the digital age: Publishing historical gaz...
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Hadoop ecosystem for health/life sciences
Reminiscing about interoperability
Scaling up Linked Data
Annotating Scholarly Resources
DBpedia Archive using Memento, Triple Pattern Fragments, and HDT
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Druid Scaling Realtime Analytics
Large Scale Data With Hadoop
Signposting for Repositories
20131205 hadoop-hdfs-map reduce-introduction
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Introduction to Hadoop and MapReduce
Hadoop Family and Ecosystem
List of Engineering Colleges in Uttarakhand
Ad

Similar to NISO ResourceSync Training Session (20)

PPTX
ResourceSync tutorial OAI8
PPTX
ResourceSync Tutorial from Open Repositories 2013
PPTX
ResourceSync - NISO Update Jan 2014
PPTX
ResourceSync in 24x7
PDF
NISO Forum, Denver, September 24, 2012: ResourceSync: Web-Based Resource Sync...
PDF
Resource sync overview and real-world use cases for discovery, harvesting, an...
PDF
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
PDF
ResourceSync: Leveraging Sitemaps for Resource Synchronization
PDF
The Syncables Framework
PDF
BitTorrent Sync IT Overview
PDF
Technical Challenges in Resource Discovery
PDF
Technical Coping Strategies for Resource Discovery - Paul Walk
PPT
KnowNow Syndication-Oriented Architecture
PDF
Information sharing pipeline
PDF
SHARE Notification Service, December 2014
PDF
Ektron E sync
PDF
Developing a Globally Distributed Purging System
PPT
WEB 2.0
PDF
Engineering a Semantic Web (Spring 2018)
PPT
Web 2 Fpl Visual
ResourceSync tutorial OAI8
ResourceSync Tutorial from Open Repositories 2013
ResourceSync - NISO Update Jan 2014
ResourceSync in 24x7
NISO Forum, Denver, September 24, 2012: ResourceSync: Web-Based Resource Sync...
Resource sync overview and real-world use cases for discovery, harvesting, an...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync: Leveraging Sitemaps for Resource Synchronization
The Syncables Framework
BitTorrent Sync IT Overview
Technical Challenges in Resource Discovery
Technical Coping Strategies for Resource Discovery - Paul Walk
KnowNow Syndication-Oriented Architecture
Information sharing pipeline
SHARE Notification Service, December 2014
Ektron E sync
Developing a Globally Distributed Purging System
WEB 2.0
Engineering a Semantic Web (Spring 2018)
Web 2 Fpl Visual
Ad

More from National Information Standards Organization (NISO) (20)

PPTX
Larry Bennett_ ALA Annual Convention 2025AL2 slides.pptx
PPTX
Potash "Our Journey & Vision for Accessible Content"
PPTX
O'Leary "Progress Assessment - How Far Are We from Delivery"
PPTX
Carpenter and O'Leary "Accessibility Standards and the Future of Inclusive Pu...
PPTX
Davidian "Transfer Code of Practice Standing Committee Update"
PPTX
Patham "NISO Open Discovery Initiative (ODI) Update"
PPTX
Hichliffe "A Standard Terminology for Peer Review"
PPTX
Levin "KBART RP Update at ALA Annual 2025"
PPTX
Carpenter "Advancing Infrastructure for Sustainable Collections: CCLP Project...
PPTX
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
PPTX
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
PDF
Carpenter "2025 NISO Annual Members Meeting"
PPTX
Allen "Social Marketing in Scholarly Communications"
PPTX
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
PDF
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
PDF
Pfeiffer "Secrets to Changing Behavior in Scholarly Communication: A 2025 NIS...
PPTX
Gilstrap "Accessibility Essentials: A 2025 NISO Training Series, Session 7, M...
PPTX
Turner "Accessibility Essentials: A 2025 NISO Training Series, Session 7, Lan...
PPTX
Comeford "Accessibility Essentials: A 2025 NISO Training Series, Session 7, A...
PPTX
Laverick and Richard "Accessibility Essentials: A 2025 NISO Training Series, ...
Larry Bennett_ ALA Annual Convention 2025AL2 slides.pptx
Potash "Our Journey & Vision for Accessible Content"
O'Leary "Progress Assessment - How Far Are We from Delivery"
Carpenter and O'Leary "Accessibility Standards and the Future of Inclusive Pu...
Davidian "Transfer Code of Practice Standing Committee Update"
Patham "NISO Open Discovery Initiative (ODI) Update"
Hichliffe "A Standard Terminology for Peer Review"
Levin "KBART RP Update at ALA Annual 2025"
Carpenter "Advancing Infrastructure for Sustainable Collections: CCLP Project...
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
Carpenter "2025 NISO Annual Members Meeting"
Allen "Social Marketing in Scholarly Communications"
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
Pfeiffer "Secrets to Changing Behavior in Scholarly Communication: A 2025 NIS...
Gilstrap "Accessibility Essentials: A 2025 NISO Training Series, Session 7, M...
Turner "Accessibility Essentials: A 2025 NISO Training Series, Session 7, Lan...
Comeford "Accessibility Essentials: A 2025 NISO Training Series, Session 7, A...
Laverick and Richard "Accessibility Essentials: A 2025 NISO Training Series, ...

Recently uploaded (20)

PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
TLE Review Electricity (Electricity).pptx
PPTX
The various Industrial Revolutions .pptx
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Architecture types and enterprise applications.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PPTX
Chapter 5: Probability Theory and Statistics
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
STKI Israel Market Study 2025 version august
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Hybrid model detection and classification of lung cancer
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
project resource management chapter-09.pdf
Assigned Numbers - 2025 - Bluetooth® Document
TLE Review Electricity (Electricity).pptx
The various Industrial Revolutions .pptx
Final SEM Unit 1 for mit wpu at pune .pptx
Getting started with AI Agents and Multi-Agent Systems
Developing a website for English-speaking practice to English as a foreign la...
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Architecture types and enterprise applications.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
Chapter 5: Probability Theory and Statistics
A comparative study of natural language inference in Swahili using monolingua...
STKI Israel Market Study 2025 version august
Programs and apps: productivity, graphics, security and other tools
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Hybrid model detection and classification of lung cancer
OMC Textile Division Presentation 2021.pptx
project resource management chapter-09.pdf

NISO ResourceSync Training Session

  • 1. NISO Training http://www.niso.org/workrooms/onixpl-encoding/ NISO Training ResourceSync: A Web-Based Resource Synchronization Framework December 3, 2013 Speakers: Bernhard Haslhofer - Postdoc Research Associate Department of Computer Science, University of Vienna Simeon Warner - Information Science, Cornell University
  • 2. ResourceSync: A Web-Based Resource Synchronization Framework #resourcesync ResourceSync is funded by The Sloan Foundation & JISC ResourceSync Webinar December 3 2013 2
  • 3. This is a short version of the complete ResourceSync tutorial, which is available at http://www.slideshare.net/OpenArchivesInitiative/resourcesync-tutorial ResourceSync Webinar December 3 2013 3
  • 4. ResourceSync Tutorial History • OAI8, June 2013 – Open Repositories, July 2013 – JCDL, July 2013 – TPDL 2013, September 2013 –LITA Forum, November 2013, SWIB November 2013, … Presenters Simeon Warner Cornell University Bernhard Haslhofer University of Vienna ResourceSync Webinar December 3 2013 4
  • 5. ResourceSync Tutorial Contributors Martin Klein Herbert Van de Sompel Robert Sanderson Los Alamos National Laboratory Los Alamos National Laboratory Los Alamos National Laboratory <martinklein0815@gmail.com> <hvdsomp@gmail.com> <azaroth24@gmail.com> @mart1nkle1n @hvdsomp @azaroth24 Simeon Warner Cornell University <simeon.warner@cornell.edu> @zimeon Michael L. Nelson Old Dominion University <mln@cs.odu.edu> @phonedude_mln Richard Jones Cottage Labs <richard@cottagelabs.com> @cottagelabs ResourceSync Webinar December 3 2013 5
  • 6. OAI Herbert Van de Sompel Martin Klein Robert Sanderson (Los Alamos National Laboratory) Simeon Warner (Cornell University) NISO Todd Carpenter Nettie Lagace University of Oxford Graham Klyne Bernhard Haslhofer (University of Vienna) Michael L. Nelson (Old Dominion University) Lyrasis Peter Murray Carl Lagoze (University of Michigan) ResourceSync Webinar December 3 2013 6
  • 7. ResourceSync Technical Group LOCKSS Ex Libris Inc. Shlomo Sanders David Rosenthal JISC Richard Jones Paul Walk Stuart Lewis RedHat OCLC Christian Sadilek Library of Congress Jeff Young Kevin Ford ResourceSync Webinar December 3 2013 7
  • 8. Timeline, Status of Specification(s) • August 2013 o o Release of ResourceSync framework Core specification - Version 0.9.1 Public draft of ResourceSync Archives specification released • September 2013 o Core specification on its way to become an ANSI standard • November 2013 o Internal draft of ResourceSync Notification specification • January 2014 o Public draft of ResourceSync Notification specification • Mid 2014 o Core specification becomes ANSI/NISO standard ResourceSync Webinar December 3 2013 8
  • 9. Pointers • Specification http://www.openarchives.org/rs/ http://www.openarchives.org/rs/resourcesync http://www.openarchives.org/rs/archives • List for public comment https://groups.google.com/d/forum/resourcesync • Client and simulator code http://github.org/resync/resync http://github.org/resync/simulator ResourceSync Webinar December 3 2013 9
  • 10. ResourceSync - Agenda 1. ResourceSync: Problem Perspective & Conceptual Approach 2. Motivation & Use Cases 3. Framework Walkthrough 4. Framework (Technical) Details 5. Implementation 6. Q&A ResourceSync Webinar December 3 2013 10
  • 11. ResourceSync - Agenda 1. ResourceSync: Problem Perspective & Conceptual Approach ResourceSync Webinar December 3 2013 11
  • 12. Synchronize What? • Web resources o things with a URI that can be dereferenced • Focus on needs of research communication and cultural heritage organizations o but aim for generality ResourceSync Webinar December 3 2013 12
  • 13. Synchronize What? • Small websites/repositories (a few resources) to large repositories/datasets/linked data collections (many millions of resources) sync sync ResourceSync Webinar December 3 2013 13
  • 14. Synchronize What? • Low change frequency (weeks/months) to high change frequency (seconds) • Synchronization latency and accuracy needs may vary sync sync sync ResourceSync Webinar December 3 2013 14
  • 15. Why? … because lots of projects and services are doing synchronization but have to resort to ad-hoc, case by case, approaches! • Project team involved with projects that need this • Experience with OAI-PMH: widely used in repos but o XML metadata only o Attempts at synchronizing actual content via OAI-PMH (complex object formats, dc:identifier) not successful. o Web technology has moved on since 1999 • Devise a shared solution for data, metadata, linked data? ResourceSync Webinar December 3 2013 15
  • 16. ResourceSync Problem • Consideration: • Source (server) A has resources that change over time: they get created, modified, deleted • Destination (servers) X, Y, and Z leverage (some) resources of Source A. • Problem: • Destinations want to keep in step with the resource changes at Source A: resource synchronization. • Goal: • Design an approach for resource synchronization aligned with the Web Architecture that has a fair chance of adoption by different communities. • The approach must scale better than recurrent HTTP HEAD/GET on resources. ResourceSync Webinar December 3 2013 16
  • 17. Source: Core Synchronization Capabilities P U L L 1. Describing content – publish a list of resources available for synchronization to enable Destinations to perform an initial load or catch-up with a Source 2. Packaging content – bundle resources to enable bulk download by destinations 3. Describing changes – publish a list of resource changes to enable destinations to stay synchronized and decrease latency 4. Packaging changes – bundle resource changes for bulk download by destinations ResourceSync Webinar December 3 2013 17
  • 18. Source: Notifications Capabilities To reduce synchronization latency and to optimize the synchronization process the Source can support: P • U S • H 1. Change Notification • Notifies about changes to particular resources • e.g., resource A has been updated | created | deleted 2. Framework Notification • Notifies about changes to capabilities i.e., their documents • e.g., a Change List has been updated | created | deleted ResourceSync Webinar December 3 2013 18
  • 19. Source: Synchronization Features 1. Discovery of capabilities – support Destinations in discovering all offered capabilities o Applies to PULL, PUSH, capabilities 1. Linking to related resources – provide links from resources subject to synchronization to related resources o Applies to PULL, PUSH capabilities ResourceSync Webinar December 3 2013 19
  • 20. Destination: Synchronization Needs 1. Baseline synchronization – A destination must be able to perform an initial load or catch-up with a source - avoid out-of-band setup 2. Incremental synchronization – A destination must have some way to keep up-to-date with changes at a source - subject to some latency; minimal: create/update/delete - allow to catch-up after destination has been offline 3. Audit – A destination should be able to determine whether it is synchronized with a source - regarding coverage and accuracy ResourceSync Webinar December 3 2013 20
  • 21. ResourceSync - Agenda 2. Motivation & Use Cases ResourceSync Webinar December 3 2013 21
  • 22. Use Case 1: arXiv Mirroring and Data Sharing • Repository of scholarly articles in physics, mathematics, computer science, etc. • > 850k articles • approx. 1.5 revisions per article on average • approx. 75k new articles per year • Each article has full-text and separate metadata record • approx. 3.8M resources ResourceSync Webinar December 3 2013 22
  • 23. Use Case 1: arXiv Mirroring and Data Sharing • 2,700 updates daily o at 8pm EST o Currently using homebrew mirroring solution (running with minor modifications since 1994!) o occasional rsync (file systemspecific, auth issues) ResourceSync Webinar December 3 2013 23
  • 24. Use Case 1: arXiv Mirroring / Data Sharing • GOAL: Keep mirror sites synchronized with daily changes • WANT: o o o o o o high consistency moderate latency robustness to global network outages (low admin effort) ability to verify sync status in case of questions low admin effort (i.e. standard approach, standard tools) reasonable consistency, latency, efficiency ResourceSync Webinar December 3 2013 24
  • 25. Use Case 2: DBpedia Live Duplication • Average of 2 updates per second • Low latency desirable => need for a push technology ResourceSync Webinar December 3 2013 25
  • 26. ResourceSync - Agenda 3. Framework Walkthrough ResourceSync Webinar December 3 2013 26
  • 27. Source Capability 1: Describing Content In order to advertise the resources that a source wants destinations to know about, it may describe them: o o Publish a Resource List, a list of resource URIs and possibly associated metadata - Destination GETs the Resource List - Destination GETs listed resources by their URI A Resource List describes the state of a set of resources at one point in time (snapshot) ResourceSync Webinar December 3 2013 27
  • 28. 28
  • 29. 29
  • 30. Source Capability 2: Packaging Content By default, content is transferred in response to a GET issued by a destination against a URI of a source’s resource. But a source may support additional mechanisms: o o Publish a Resource Dump, a document that points to packages of resource representations and necessary metadata - Destination GETs the package - Destination unpacks the package - ZIP format supported A Resource Dump and the packages it points to reflect the state of a set of resources at one point in time (snapshot) ResourceSync Webinar December 3 2013 30
  • 31. 31
  • 32. 32
  • 33. Source Capability 3: Describing Changes In order to achieve lower latency and/or greater efficiency, a source may communicate about changes to its resources: o o Publish a Change List, a list of recent change events (created, updated, deleted resource) - Destination acts upon change events, e.g. GETs created/updated resources, removes deleted resources. A Change List pertains to resources that changed in a temporal interval with a start- and an end-date - If a resource changed more than once, it will be listed more than once ResourceSync Webinar December 3 2013 33
  • 34. 34
  • 35. 35
  • 36. 36
  • 37. Source Capability 4: Packaging Changes In order to reduce the number of requests to obtain resource changes, a source may provide packaged bitstreams for changed resources: o o Publish a Change Dump, a document that points to packages containing bitstreams of recently changed resource and necessary metadata - Destination GETs the package - Destination unpacks the package - ZIP format supported A Change Dump and its packages pertain to resources that changed in a temporal interval with a start- and an end-date - If a resource changed more than once, it will be included more than once ResourceSync Webinar December 3 2013 37
  • 38. 38
  • 39. Destination: Key Processes ResourceSync Webinar December 3 2013 39
  • 40. ResourceSync - Agenda 4. Framework (Technical) Details ResourceSync Webinar December 3 2013 40
  • 41. So Many Choices Push DSNotify OAI-PMH rsync Crawl Pull OAI-ORE RDFsync WebDAV Col. Syn. XMPP Atom SWORD Sitemap SPARQLpush SDShare AtomPub RSS PubSubHubbub XMPP ResourceSync Webinar December 3 2013 41
  • 42. So Many Choices Push DSNotify OAI-PMH rsync Crawl Pull OAI-ORE RDFsync WebDAV Col. Syn. XMPP Atom SWORD Sitemap SPARQLpush SDShare AtomPub RSS PubSubHubbub XMPP ResourceSync Webinar December 3 2013 42
  • 45. ResourceSync Sitemap Extensions <urlset xmlns=http://www.sitemaps.org/schemas/sitemap/0.9 xmlns:rs="http://www.openarchives.org/rs/terms/”> <rs:ln …/> <rs:md …/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:ln …/> <rs:md …/> </url> <url> … </url> </urlset> ResourceSync Webinar December 3 2013 45
  • 46. Related Resource Metadata Summary • Attributes of the <rs:ln> element; c.f. resource metadata + pri Element/Attribute Description Defined by <rs:ln> ResourceSync encoding HTTP Content-Encoding header value RFC2616 hash One or more content digests (md5, sha-1, sha-256) Atom Link Ext. href Related resource URI (identity) RFC4287 length HTTP Content-Length header value RFC4287 modified Timestamp of last change (c.f. <lastmod>) Atom Link Ext. path Path in ZIP package (Dump Manifests only) ResourceSync pri Priority of link RFC6249 rel Relation - IANA registered or URI RFC4287 type HTTP Content-Type header value RFC4287 ResourceSync Webinar December 3 2013
  • 47. Resource Metadata Summary Element/Attribute <loc> <lastmod> Description Resource URI (identity) Timestamp of last change Defined by sitemaps sitemaps <changefreq> Expected update frequency sitemaps <rs:md> change encoding hash length path type ResourceSync Change type (Change List & Change Dump Manifest only) ResourceSync HTTP Content-Encoding header value RFC2616 One or more content digests (md5, sha-1, Atom Link Ext. sha-256) HTTP Content-Length header value RFC4287 Path in ZIP package (Dump Manifests only) HTTP Content-Type header value ResourceSync RFC4287 ResourceSync Webinar December 3 2013
  • 48. Link Relation Summary Relation Use in ResourceSync Defined in rel="alternate" Link from generic to specific URI HTML 5 rel="canonical" Link from specific to generic URI RFC6596 rel="collection" Resource is member of collection RFC6573 rel="contents" Link from dump to manifest rel="describedby" Has metadata HTML4 Protocol for Web Description Resources (POWDER): Description Resources rel="describes" Is metadata for The 'describes' Link Relation Type rel="duplicate" RFC6249 rel=".../rs/terms/patch" Mirror or alternative copy A patch -- efficient change information rel="memento" Link to time-specific URI Memento Internet Draft rel="timegate" Link to timegate Memento Internet Draft rel="via" Provenance chain, came from RFC4287 This specification ResourceSync Webinar December 3 2013
  • 49. Resource List <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" at="2013-01-03T09:00:00Z” completed="2013-01-03T09:01:00Z” /> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url> <url> … </url> </urlset> ResourceSync Webinar December 3 2013 49
  • 50. Resource List • Describe Source’s resources that are subject to synchronization • At one point in time (snapshot) • Creation can take some time – duration can be conveyed • Typical Destination use: Baseline Synchronization, Audit • Each URI typically listed only once • Might be expensive to generate • Destinations use @at to determine freshness • [@at, @completed] – interval of uncertainty • Destination issues GETs against URIs to obtain resources • Very similar to current Sitemaps ResourceSync Webinar December 3 2013 50
  • 51. Resource Dump <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”resourcedump" at="2013-01-02T09:00:00Z”/> <url> <loc>http://example.com/resourcedump_part1.zip</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md length=”97553" type=”application/zip"/> <rs:ln rel=”contents” href="http://example.com/resourcedump_manifest-part1.xml" type=”application/xml"/> </url> <url> <loc>http://example.com/resourcedump_part2.zip</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> </url> </urlset> ResourceSync Webinar December 3 2013 51
  • 52. Resource Dump Manifest <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”resourcedump-manifest" at="2013-01-02T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md type="text/html" path=”/resources/res1"/> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md type=”application/pdf” path=”/resources/res2"/> </url> </urlset> ResourceSync Webinar December 3 2013 52
  • 53. Resource Dump • A Resource Dump points to packages (ZIP files) that contain representations of the Source’s resources • At one point in time (snapshot) • Resource Dump is mandatory, even if there is only one ZIP file • ZIP package contains manifest, listing contained bitstreams • Typical Destination use: Baseline Synchronization, bulk download • Each URI typically listed only once • Might be expensive to generate • Destinations use @at to determine freshness • [@at, @completed] – interval of uncertainty • GETs against individual URIs from Resource List achieves the same result (ignoring varying freshness) ResourceSync Webinar December 3 2013 53
  • 54. Change List <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated" hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url> <url> … </url> </urlset> ResourceSync Webinar December 3 2013 54
  • 55. Change List • A Change List pertains to a Source’s resources that changed • Changes that occurred during a temporal interval with startand end-date • Typical Destination use: Incremental Synchronization, Audit • Changes are listed in chronological order • Multiple changes to one resource results in the resource being listed multiple times, once per change • Source determines duration of temporal interval • Destinations use @from and @until to determine freshness • Destinations issue GETs against URIs to obtain changed resources ResourceSync Webinar December 3 2013 55
  • 56. Discovery of Capabilities Requirements: • Need to discover capabilities, i.e. Resource List, Resource Dump, Change List, Change Dump, Archives, Notification channels • Need to know the type of capability each document represents. Approach: • The Source publishes a Capability List that enumerates the capabilities it supports. • By pointing at Resource List, Change List, Resource Dump, etc. using appropriate relation types, e.g. “resourcelist”, “changelist”, “resourcedump” etc. http://www.openarchives.org/rs/resourcesync#CapabilityList ResourceSync Webinar December 3 2013 56
  • 57. Discovery of Capabilities ResourceSync Webinar December 3 2013 57
  • 58. Capability List <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”capabilitylist”/> <url> <loc>http://example.com/dataset1/resourcelist.xml</loc> <rs:md capability=”resourcelist”/> </url> <url> <loc>http://example.com/dataset1/changelist.xml</loc> <rs:md capability=”changelist”/> </url> <url> <loc>http://example.com/dataset1/resourcedump.xml</loc> <rs:md capability=”resourcedump”/> </url> </urlset> ResourceSync Webinar December 3 2013 58
  • 59. Discovery of Capability Lists Requirements: • Need to discover a Capability List Approaches: • Introduce a link in the HTTP Link header of a resources that is subject to synchronization, pointing at the Capability List with the relation type “resourcesync” • Introduce a link from an HTML document that is subject to synchronization (<head> section), pointing at the Capability List with the relation type “resourcesync” • Link from a Resource List, etc. to the Capability List with the relation type “up” Link header on example.com/res1.pdf Link: <example.com/dataset1/capabilitylist.xml>;rel=“resourcesync” ResourceSync Webinar December 3 2013 59
  • 60. Discovery via robots.txt • Resource Lists are (enhanced) Sitemaps • Sitemaps can be discovered via robots.txt • Ergo, Resource Lists should be discoverable via robots.txt User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Sitemap: http://example.com/dataset1/resourcelist.xml ResourceSync Webinar December 3 2013 60
  • 62. Motivation for Notifications • Reduce synchronization latency by having the Source push out resource change information • To avoid continuous pull of Change Lists by Destinations • Share information about changes to the Source’s ResourceSync implementation, e.g. announcement of new Resource List, new Capability List, etc. • To avoid continuous polling of e.g. Resource Lists, ResourceSync Description ResourceSync Webinar December 3 2013 62
  • 63. Source: Notification Capabilities • P U • S H 1. Change Notification • Notifies about changes to particular resources • e.g., resource A has been updated | created | deleted 2. Framework Notification • Notifies about changes to capabilities i.e., their documents • e.g., a Change List has been updated | created | deleted • Also for Capability Lists and Source Description ResourceSync Webinar December 3 2013 63
  • 64. Notification Channels • Notification sent via channels • Resource Notification: one channel per set of resources • Framework Notification: one channel per set of resources • Sent on level of capability document, not on index-level • Notifications about changes to Source Description sent on all Framework Notification channels • Payload for notifications: <urlset> documents • Transport protocol for notifications under discussion: • PubSubHubbub https://pubsubhubbub.googlecode.com/git/pubsubhubbub-core0.4.html - current choice • WebSockets -http://tools.ietf.org/html/rfc6455 – may be added later ResourceSync Webinar December 3 2013 64
  • 65. Change Notification Payload <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:ln rel="up" href="http://example.com/dataset1/capabilitylist.xml"/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T09:07:00Z</lastmod> <rs:md change=”updated" hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url> <url> … </url> </urlset> ResourceSync Webinar December 3 2013 65
  • 66. ResourceSync - Agenda 5. Implementation ResourceSync Webinar December 3 2013 66
  • 67. DSpace support for metadata harvesting use case DSpace Module: https://github.com/CottageLabs/DSpaceResourceSync PHP client: https://github.com/stuartlewis/resync-php http://mydspace.edu/dspace-rs/resource/123456789/7/qdc ResourceSync webapp Item handle Metadata Format ResourceSync Webinar December 3 2013 67
  • 68. ResourceSync @ arXiv • Use ResourceSync for both mirroring and public data access o efficient updates o ability to do periodic audits o public synchronization capability o reduce admin burden • Start with metadata + source for mirroring use case (doing experiments now) • Open Access use cases require processed PDF also ResourceSync Webinar December 3 2013 68
  • 69. Getting a copy of arXiv It might be as easy as: (of course, you probably have to wait a while but it is nice to know ResourceSync is stateless so one can efficiently restart) ResourceSync Webinar December 3 2013 69
  • 70. Python Library and Client • Aim to provide library code implementing all ResourceSync facilities for use in both source and destination implementations o Designed for python 2.6 (RHEL6) and 2.7 • Client (resync) supports many destination operations, inspired by the common Unix rsync program • Client also supports some operations that might be useful in a source, such as generation of static Resource Lists, or periodic Change Lists (used in arXiv experiments) • Explorer (resync-explorer) intended to allow easy inspection of a source’s resource sets and capabilities • Developed since ResourceSync v0.5, updated for v0.9.1 http://github.org/resync/resync On pypi: “easy_install resync” ResourceSync Webinar December 3 2013
  • 71. ResourceSync Source Simulator • Python code using Tornado server • Provides random set of resources of different sizes updated at a particular rate • Very useful for testing Destination code http://github.com/resync/simulator ResourceSync Webinar December 3 2013
  • 72. ResourceSync - Agenda 6. Q&A ResourceSync Webinar December 3 2013 72
  • 73. ResourceSync: A Web-Based Resource Synchronization Framework #resourcesync ResourceSync is funded by The Sloan Foundation & JISC ResourceSync Webinar December 3 2013 73
  • 74. THANK YOU We look forward to seeing you at a future NISO training event.

Editor's Notes

  • #16: LANL Memento Aggregator of IIPC; Europeana does metadata via OAI-PMH but anticipate content also; arXiv – mirroring and data sharing; Linked data @ BBC; DBpedia, journal data at LANLREST not about in 1999
  • #26: Semantic web version of wikipedia; want mirror to provide reliable basis for local services
  • #40: Top line – just metadata about resources, destination uses GET to get them (duh)Bottom line – packaged content =&gt; fewer round trips
  • #42: Rsyncetc just reference; push vs pull -&gt; both; many other parts
  • #43: Rsyncetc just reference; push vs pull -&gt; both; many other parts
  • #49: Add: rel=“contents”rel=“archives”
  • #69: Test site, has subsets of arXiv and even complete source plus metadata (at present not up to date with 0.9)
  • #70: No way around the difficulty of transferring 1TB initially but then a daily or weekly sync is efficient, and it still works even after some arbitrary time.