SlideShare a Scribd company logo
Data curation and data archiving
at different stages of the research
process
Henk van den Berg, Jerry de Vries, Andrea Scharnhorst (DANS)
DANS Colloquium, March 21, 2019
KDP - Dutch women authors
and their Dutch readers
(until 1900)
Pipeline
Timbuctoo -> Easy
An exploration
This presentation
Data cleaning/curation
• Multilayered application
• Multiple compartmented RDF datasets
• Multiple extendable interfaces
• Restful API
• GraphQL
• (external) Web UI
• …
• Users/Groups can
• interact with one or more datasets
• Upload/convert datasets, i.e. excel sheets
• …
• Instances can exchange datasets via ResourceSync
A technical view
Ingest of data
Manual deposit
(individual)
User
(Organization)
EASY SWORD v2.0
Automated deposit
Web form
Collective agreement & coordination
over multiple datasets:
• License Agreement and Terms of Use
• Intellectual property rights
• Personal data (removed)
• (Sufficient) metadata
• File formats
Ingest flow
Timbuctoo EASY
ResourceSync
Swordv2
Who is at the wheel?
What is inside?
A pipeline?
What is needed on the
supplier side?
What are the implications
for the archive?
RDF datasets
• RDF datasets package up zero or more named RDF graphs along with a single unnamed,
default RDF graph.
• The graphs in a single dataset may share blank nodes.
https://www.w3.org/TR/rdf11-mt/#rdf-datasets
RDF Graph
• An RDF graph is a set of RDF triples.
https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/#section-rdf-graph
An RDF triple consists of three components:
• the subject, which is an IRI or a blank node
• the predicate, which is an IRI
• the object, which is an IRI, a literal or a blank node
https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/#dfn-rdf-triple
RDF datasets, graphs, triples and quads
<s> <p> <o> <g>
A quad
<s> <p> <o>
A triple
Shoving in the boxes
Manifestations of RDF datasets don’t
necessarily represent workable units
Libraries that work with RDF:
• RDF4J (pka Sesame) - Java
• rdflib - Python
• org.eclipse.rdf4j.repository.Repository
• rdflib.Graph.ConjunctiveGraph
Backed by memory/file system/database
• Need to handle a dataset as a unit,
irrespective of how large it is
• Streamline interactions with RDF datasets
• verify
• compare
• programmatic access
• query (SPARQL)
• reasoning
• split
• ...
What is inside? – verifying datasets
Load dataset into Repository  Valid RDF
Syntax (easy)
Semantics (difficult)
1. Are predicate and object URI’s resolvable?
• Do they point to a external RDF-definition of the term?
• If so: are syntax and semantics of this RDF-definition correct?
• Start at 1 ...
• Is it a web page?
• Has it anything to do with this term?
2. Do predicate and object URI’s point to an internal definition?
• Start at 1 ...
Is this approach at all feasible?
What is inside? – fingerprints
Quality assessment of a dataset:
Can we gather relative information on one A4 so that an archivist can do a reasonable guess?
fingerprint
~/work/ww/data/ww_query_result_20190212.xml
contexts : 1
statements: 184.263
literals : 114.884
literal ratio: 0,62
subject netlocs: 194 {'resource.huygens.knaw.nl': 178815, '': 3912,
'www.dbnl.org': 445,...
object netlocs: 194 {'resource.huygens.knaw.nl': 66052, '': 1745...
example.org | s 0 | p 0 | o 0 | total 0
predicate netlocs: 5 | {'resource.huygens.knaw.nl': 115766, 'www.w3.org':
22088, 'purl.org': 26859, 'schema.org': 15640, 'www.purl.org': 3910}
predicates: 76
http://www.w3.org/2004/02/skos/core#note: 6973
What is inside? – Public datasets
Two instances of Timbuctoo – public datasets:
ANANSI: https://data.anansi.clariah.nl
Huygens data: https://repository.huygens.knaw.nl
https://data.anansi.clariah.nl
Timbuctoo EASY
ResourceSync
Swordv2
Who is at the wheel?
What is inside?
A pipeline?
What is needed on the
supplier side?
What are the implications
for the archive?
Legal agreements
Coordination
Responsibility
Organizational structure
• authentication - single sign-on/handshake
• metadata quality, completeness / feedback
• data structural integrity, personal data removed / feedback
• legal framework
• license
• terms of use
A digital archive guards
the technical and cognitive interpretability
of its assets over time
Technical Cognitive
Representation
Machine Human
C
o
n
t
e
x
t
C
o
n
t
e
x
t
Things
and
phenomena
RDF
datasets
Machine
The Web
<…./>
Archiving RDF datasets

More Related Content

PDF
EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org
PPT
Linking library data
PPTX
Timbuctoo 2 EASY
PDF
Explicit Semantics in Graph DBs Driving Digital Transformation With Neo4j
ODP
RDF and the Semantic Web -- Joanna Pszenicyn
PDF
Ephedra: efficiently combining RDF data and services using SPARQL federation
PPT
Freire model api
PPTX
Semantically-Enabled Digital Investigations
EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org
Linking library data
Timbuctoo 2 EASY
Explicit Semantics in Graph DBs Driving Digital Transformation With Neo4j
RDF and the Semantic Web -- Joanna Pszenicyn
Ephedra: efficiently combining RDF data and services using SPARQL federation
Freire model api
Semantically-Enabled Digital Investigations

What's hot (20)

ODP
Linked Data
PDF
ESWC 2017 Tutorial Knowledge Graphs
PDF
Beyond 2022 project presentation 2021
PDF
Benchmarking RDF Metadata Representations: Reification, Singleton Property an...
ODP
2014-02-27 Wikidata talk Cambridge
PDF
LOTUS: Adaptive Text Search for Big Linked Data
PDF
Christian Jakenfelds
PDF
Resource description framework
PDF
Discovering Related Data Sources in Data Portals
PPTX
Semantic Variation Graphs the case for RDF & SPARQL
PDF
PDF
Linked data experience at Macmillan: Building discovery services for scientif...
PDF
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
PPT
Rdf And Rdf Schema For Ontology Specification
PPTX
Enterprise knowledge graphs
PPT
RDF and Open Linked Data, a first approach
PPTX
Publishing and Using Linked Open Data - Day 4
PDF
(Enterprise) Linked Data Platform a new standard to manage LOD
PDF
Smart Data Applications powered by the Wikidata Knowledge Graph
PDF
Sparql a simple knowledge query
Linked Data
ESWC 2017 Tutorial Knowledge Graphs
Beyond 2022 project presentation 2021
Benchmarking RDF Metadata Representations: Reification, Singleton Property an...
2014-02-27 Wikidata talk Cambridge
LOTUS: Adaptive Text Search for Big Linked Data
Christian Jakenfelds
Resource description framework
Discovering Related Data Sources in Data Portals
Semantic Variation Graphs the case for RDF & SPARQL
Linked data experience at Macmillan: Building discovery services for scientif...
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Rdf And Rdf Schema For Ontology Specification
Enterprise knowledge graphs
RDF and Open Linked Data, a first approach
Publishing and Using Linked Open Data - Day 4
(Enterprise) Linked Data Platform a new standard to manage LOD
Smart Data Applications powered by the Wikidata Knowledge Graph
Sparql a simple knowledge query
Ad

Similar to Data curation and data archiving at different stages of the research process (20)

PPTX
Mining and Managing Large-scale Linked Open Data
PPTX
Mining and Managing Large-scale Linked Open Data
PPTX
RDF-Gen: Generating RDF from streaming and archival data
PPT
Re-using Media on the Web: Media fragment re-mixing and playout
PPTX
Democratizing Big Semantic Data management
PDF
Indexing data on the web a comparison of schema level indices for data search
PPT
Friday talk 11.02.2011
PPTX
SWT Lecture Session 2 - RDF
PPT
Elag 2012 - Under the hood of 3TU.Datacentrum.
PPTX
Publishing "5 star" data: the case for RDF
PDF
Data integration with a façade. The case of knowledge graph construction.
PPTX
Knowledge Graph Introduction
PDF
IRJET- Data Retrieval using Master Resource Description Framework
PPTX
Jarrar: RDFs -RDF Schema
PDF
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
PDF
A Hands On Overview Of The Semantic Web
PDF
Linked Data
PPTX
Semantic Web and Related Work at W3C
PDF
RSP-QL*: Querying Data-Level Annotations in RDF Streams
PPT
Structured Dynamics' Semantic Technologies Product Stack
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open Data
RDF-Gen: Generating RDF from streaming and archival data
Re-using Media on the Web: Media fragment re-mixing and playout
Democratizing Big Semantic Data management
Indexing data on the web a comparison of schema level indices for data search
Friday talk 11.02.2011
SWT Lecture Session 2 - RDF
Elag 2012 - Under the hood of 3TU.Datacentrum.
Publishing "5 star" data: the case for RDF
Data integration with a façade. The case of knowledge graph construction.
Knowledge Graph Introduction
IRJET- Data Retrieval using Master Resource Description Framework
Jarrar: RDFs -RDF Schema
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
A Hands On Overview Of The Semantic Web
Linked Data
Semantic Web and Related Work at W3C
RSP-QL*: Querying Data-Level Annotations in RDF Streams
Structured Dynamics' Semantic Technologies Product Stack
Ad

More from Andrea Scharnhorst (20)

PDF
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and the...
PDF
The Polifonia portal: a confluence of user stories, research pilots, data man...
POTX
Floating classifications - Knowledge Organization Systems in past, present an...
PDF
Digging into the Knowledge Graph (2017-2020)
PPTX
Dilemmata of research infrastructures
PDF
DARIAH Contributions 2019
PPTX
SUSTAINABILITY BEYOND GUIDELINES
PPT
Information science in practice - research at a Trusted Digital Archive
PPTX
How to use science maps to navigate large information spaces? What is the lin...
PPTX
Bibliometrics, Webometrics, Altmetrics, Alternative metrics.
PPTX
Why do we need to model the science system?
PPTX
Humanities and ICT
PPTX
Comparison of methods – an unloved duty? Examples from an ongoing bibliometri...
PPTX
Between  information  retrieval  services  and bibliometrics  research. New  ...
PPTX
Knowledge maps for libraries and archives - uses and use cases
PPTX
Digital Humanities in The Netherlands DARIAH, CLARIN, CLARIAH, … DHx.0 A pers...
PPTX
Rare (and emergent) disciplines in the light of science studies
PPT
Drowning in information – the need of macroscopes for research funding
PPT
Digital Humanities as Innovation: ‘constant revolution’ or ‘moving to the su...
PPT
Mapping Digital Humanities projects. A pilot of a DH project registry for The...
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and the...
The Polifonia portal: a confluence of user stories, research pilots, data man...
Floating classifications - Knowledge Organization Systems in past, present an...
Digging into the Knowledge Graph (2017-2020)
Dilemmata of research infrastructures
DARIAH Contributions 2019
SUSTAINABILITY BEYOND GUIDELINES
Information science in practice - research at a Trusted Digital Archive
How to use science maps to navigate large information spaces? What is the lin...
Bibliometrics, Webometrics, Altmetrics, Alternative metrics.
Why do we need to model the science system?
Humanities and ICT
Comparison of methods – an unloved duty? Examples from an ongoing bibliometri...
Between  information  retrieval  services  and bibliometrics  research. New  ...
Knowledge maps for libraries and archives - uses and use cases
Digital Humanities in The Netherlands DARIAH, CLARIN, CLARIAH, … DHx.0 A pers...
Rare (and emergent) disciplines in the light of science studies
Drowning in information – the need of macroscopes for research funding
Digital Humanities as Innovation: ‘constant revolution’ or ‘moving to the su...
Mapping Digital Humanities projects. A pilot of a DH project registry for The...

Recently uploaded (20)

PDF
01-Introduction-to-Information-Management.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
Pharma ospi slides which help in ospi learning
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Classroom Observation Tools for Teachers
PPTX
Cell Structure & Organelles in detailed.
PDF
Microbial disease of the cardiovascular and lymphatic systems
01-Introduction-to-Information-Management.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Renaissance Architecture: A Journey from Faith to Humanism
Pharma ospi slides which help in ospi learning
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Module 4: Burden of Disease Tutorial Slides S2 2025
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
VCE English Exam - Section C Student Revision Booklet
TR - Agricultural Crops Production NC III.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Final Presentation General Medicine 03-08-2024.pptx
O7-L3 Supply Chain Operations - ICLT Program
STATICS OF THE RIGID BODIES Hibbelers.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Classroom Observation Tools for Teachers
Cell Structure & Organelles in detailed.
Microbial disease of the cardiovascular and lymphatic systems

Data curation and data archiving at different stages of the research process

  • 1. Data curation and data archiving at different stages of the research process Henk van den Berg, Jerry de Vries, Andrea Scharnhorst (DANS) DANS Colloquium, March 21, 2019
  • 2. KDP - Dutch women authors and their Dutch readers (until 1900) Pipeline Timbuctoo -> Easy An exploration This presentation Data cleaning/curation
  • 3. • Multilayered application • Multiple compartmented RDF datasets • Multiple extendable interfaces • Restful API • GraphQL • (external) Web UI • … • Users/Groups can • interact with one or more datasets • Upload/convert datasets, i.e. excel sheets • … • Instances can exchange datasets via ResourceSync A technical view
  • 4. Ingest of data Manual deposit (individual) User (Organization) EASY SWORD v2.0 Automated deposit Web form Collective agreement & coordination over multiple datasets: • License Agreement and Terms of Use • Intellectual property rights • Personal data (removed) • (Sufficient) metadata • File formats Ingest flow
  • 5. Timbuctoo EASY ResourceSync Swordv2 Who is at the wheel? What is inside? A pipeline? What is needed on the supplier side? What are the implications for the archive?
  • 6. RDF datasets • RDF datasets package up zero or more named RDF graphs along with a single unnamed, default RDF graph. • The graphs in a single dataset may share blank nodes. https://www.w3.org/TR/rdf11-mt/#rdf-datasets RDF Graph • An RDF graph is a set of RDF triples. https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/#section-rdf-graph An RDF triple consists of three components: • the subject, which is an IRI or a blank node • the predicate, which is an IRI • the object, which is an IRI, a literal or a blank node https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/#dfn-rdf-triple RDF datasets, graphs, triples and quads <s> <p> <o> <g> A quad <s> <p> <o> A triple
  • 7. Shoving in the boxes Manifestations of RDF datasets don’t necessarily represent workable units Libraries that work with RDF: • RDF4J (pka Sesame) - Java • rdflib - Python • org.eclipse.rdf4j.repository.Repository • rdflib.Graph.ConjunctiveGraph Backed by memory/file system/database • Need to handle a dataset as a unit, irrespective of how large it is • Streamline interactions with RDF datasets • verify • compare • programmatic access • query (SPARQL) • reasoning • split • ...
  • 8. What is inside? – verifying datasets Load dataset into Repository  Valid RDF Syntax (easy) Semantics (difficult) 1. Are predicate and object URI’s resolvable? • Do they point to a external RDF-definition of the term? • If so: are syntax and semantics of this RDF-definition correct? • Start at 1 ... • Is it a web page? • Has it anything to do with this term? 2. Do predicate and object URI’s point to an internal definition? • Start at 1 ... Is this approach at all feasible?
  • 9. What is inside? – fingerprints Quality assessment of a dataset: Can we gather relative information on one A4 so that an archivist can do a reasonable guess? fingerprint ~/work/ww/data/ww_query_result_20190212.xml contexts : 1 statements: 184.263 literals : 114.884 literal ratio: 0,62 subject netlocs: 194 {'resource.huygens.knaw.nl': 178815, '': 3912, 'www.dbnl.org': 445,... object netlocs: 194 {'resource.huygens.knaw.nl': 66052, '': 1745... example.org | s 0 | p 0 | o 0 | total 0 predicate netlocs: 5 | {'resource.huygens.knaw.nl': 115766, 'www.w3.org': 22088, 'purl.org': 26859, 'schema.org': 15640, 'www.purl.org': 3910} predicates: 76 http://www.w3.org/2004/02/skos/core#note: 6973
  • 10. What is inside? – Public datasets Two instances of Timbuctoo – public datasets: ANANSI: https://data.anansi.clariah.nl Huygens data: https://repository.huygens.knaw.nl https://data.anansi.clariah.nl
  • 11. Timbuctoo EASY ResourceSync Swordv2 Who is at the wheel? What is inside? A pipeline? What is needed on the supplier side? What are the implications for the archive? Legal agreements Coordination Responsibility Organizational structure • authentication - single sign-on/handshake • metadata quality, completeness / feedback • data structural integrity, personal data removed / feedback • legal framework • license • terms of use
  • 12. A digital archive guards the technical and cognitive interpretability of its assets over time Technical Cognitive Representation Machine Human C o n t e x t C o n t e x t Things and phenomena RDF datasets Machine The Web <…./> Archiving RDF datasets