Data curation and data archiving at different stages of the research process

Data curation and data archiving
at different stages of the research
process
Henk van den Berg, Jerry de Vries, Andrea Scharnhorst (DANS)
DANS Colloquium, March 21, 2019

KDP - Dutch women authors
and their Dutch readers
(until 1900)
Pipeline
Timbuctoo -> Easy
An exploration
This presentation
Data cleaning/curation

• Multilayered application
• Multiple compartmented RDF datasets
• Multiple extendable interfaces
• Restful API
• GraphQL
• (external) Web UI
• …
• Users/Groups can
• interact with one or more datasets
• Upload/convert datasets, i.e. excel sheets
• …
• Instances can exchange datasets via ResourceSync
A technical view

Ingest of data
Manual deposit
(individual)
User
(Organization)
EASY SWORD v2.0
Automated deposit
Web form
Collective agreement & coordination
over multiple datasets:
• License Agreement and Terms of Use
• Intellectual property rights
• Personal data (removed)
• (Sufficient) metadata
• File formats
Ingest flow

Timbuctoo EASY
ResourceSync
Swordv2
Who is at the wheel?
What is inside?
A pipeline?
What is needed on the
supplier side?
What are the implications
for the archive?

RDF datasets
• RDF datasets package up zero or more named RDF graphs along with a single unnamed,
default RDF graph.
• The graphs in a single dataset may share blank nodes.
https://www.w3.org/TR/rdf11-mt/#rdf-datasets
RDF Graph
• An RDF graph is a set of RDF triples.
https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/#section-rdf-graph
An RDF triple consists of three components:
• the subject, which is an IRI or a blank node
• the predicate, which is an IRI
• the object, which is an IRI, a literal or a blank node
https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/#dfn-rdf-triple
RDF datasets, graphs, triples and quads
<s> <p> <o> <g>
A quad
<s> <p> <o>
A triple

Shoving in the boxes
Manifestations of RDF datasets don’t
necessarily represent workable units
Libraries that work with RDF:
• RDF4J (pka Sesame) - Java
• rdflib - Python
• org.eclipse.rdf4j.repository.Repository
• rdflib.Graph.ConjunctiveGraph
Backed by memory/file system/database
• Need to handle a dataset as a unit,
irrespective of how large it is
• Streamline interactions with RDF datasets
• verify
• compare
• programmatic access
• query (SPARQL)
• reasoning
• split
• ...

What is inside? – verifying datasets
Load dataset into Repository  Valid RDF
Syntax (easy)
Semantics (difficult)
1. Are predicate and object URI’s resolvable?
• Do they point to a external RDF-definition of the term?
• If so: are syntax and semantics of this RDF-definition correct?
• Start at 1 ...
• Is it a web page?
• Has it anything to do with this term?
2. Do predicate and object URI’s point to an internal definition?
• Start at 1 ...
Is this approach at all feasible?

What is inside? – fingerprints
Quality assessment of a dataset:
Can we gather relative information on one A4 so that an archivist can do a reasonable guess?
fingerprint
~/work/ww/data/ww_query_result_20190212.xml
contexts : 1
statements: 184.263
literals : 114.884
literal ratio: 0,62
subject netlocs: 194 {'resource.huygens.knaw.nl': 178815, '': 3912,
'www.dbnl.org': 445,...
object netlocs: 194 {'resource.huygens.knaw.nl': 66052, '': 1745...
example.org | s 0 | p 0 | o 0 | total 0
predicate netlocs: 5 | {'resource.huygens.knaw.nl': 115766, 'www.w3.org':
22088, 'purl.org': 26859, 'schema.org': 15640, 'www.purl.org': 3910}
predicates: 76
http://www.w3.org/2004/02/skos/core#note: 6973

What is inside? – Public datasets
Two instances of Timbuctoo – public datasets:
ANANSI: https://data.anansi.clariah.nl
Huygens data: https://repository.huygens.knaw.nl
https://data.anansi.clariah.nl

Timbuctoo EASY
ResourceSync
Swordv2
Who is at the wheel?
What is inside?
A pipeline?
What is needed on the
supplier side?
What are the implications
for the archive?
Legal agreements
Coordination
Responsibility
Organizational structure
• authentication - single sign-on/handshake
• metadata quality, completeness / feedback
• data structural integrity, personal data removed / feedback
• legal framework
• license
• terms of use

A digital archive guards
the technical and cognitive interpretability
of its assets over time
Technical Cognitive
Representation
Machine Human
C
o
n
t
e
x
t
C
o
n
t
e
x
t
Things
and
phenomena
RDF
datasets
Machine
The Web
<…./>
Archiving RDF datasets

Data curation and data archiving at different stages of the research process

More Related Content

What's hot (20)

Similar to Data curation and data archiving at different stages of the research process (20)

More from Andrea Scharnhorst (20)

Recently uploaded (20)

Data curation and data archiving at different stages of the research process