Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group, by Andreas Rauber

Enabling Precise Identification and
Citability of Dynamic Data:
Recommendations of the RDA Working
Group WGDC
Andreas Rauber
Technical University of Vienna
rauber@ifs.tuwien.ac.at
http://www.ifs.tuwien.ac.at/~andi

Outline
 Challenges in Data Identification and Citation
 Recommendation of the RDA Working Group
 Pilots and Adoption
 Summary

Data Citation
Why do we want to precisely identify data
and cite it???
 Citation to give and accumulate credit for the
- Creator of the dataset
- Datacenter hosting the data
- Funder providing funding
- …
 Citation to assist in research
- Reproduce a study
- Compare different models
- Automate meta-studies
- …

Data Citation
 Citing data may seem easy
- from providing a URL in a footnote
- via providing a reference in the bibliography section
- to assigning a PID (DOI, ARK, …) to dataset in a repository
 What’s the problem?
Page 4

Granularity of Data Identification / Citation
 What about the granularity of data to be identified/cited?
- Databases collect enormous amounts of data over time
- Researchers use specific subsets of data
- Need to identify precisely the subset used
 Current approaches
- Citing entire dataset, providing textual description of subset
-> imprecise (ambiguity)
- Storing a copy of subset as used in study -> scalability
- Storing list of record identifiers in subset -> scalability,
not for arbitrary subsets (e.g. when not entire record selected)
 Would like to be able to identify & cite precisely the
subset of (dynamic) data used in a study
Page 5

Citation of Dynamic Data
 Citable datasets have to be static
- Fixed set of data, no changes:
no corrections to errors, no new data being added
 But: (research) data is dynamic
- Adding new data, correcting errors, enhancing data quality, …
- Changes sometimes highly dynamic, at irregular intervals
 Current approaches
- Identifying entire data stream, without any versioning
- Using “accessed at” date
- “Artificial” versioning by identifying batches of data (e.g.
annual), aggregating changes into releases (time-delayed!)
 Would like to cite precisely the data as it existed at certain
point in time, without delaying release of new data
Page 6

Data Citation – Requirements
 Dynamic data
- corrections, additions, …
 Arbitrary subsets of data (granularity)
- rows/columns, time sequences, …
- from single number to the entire set
 Stable across technology changes
- e.g. migration to new database
 Machine-actionable
- not just machine-readable,
definitely not just human-readable and interpretable
 Scalable to very large / highly dynamic datasets
- But: should also work for small and/or static datasets!

 Research Data Alliance
 WG on Data Citation:
Making Dynamic Data Citeable
 March 2014 – Sep 2015
- Concentrating on the problems of
large, dynamic (changing) datasets
- Focus! Identification of data!
Not: PID systems, metadata, citation string, attribution, …
- Liaise with other WGs and initiatives on data citation
(CODATA, DataCite, Force11, …)
- Continuing support for adoption
https://rd-alliance.org/working-groups/data-citation-wg.html
RDA WG Data Citation

Data Citation – Output
 14 Recommendations
grouped into 4 phases:
- Preparing data and query store
- Persistently identifying specific data sets
- Resolving PIDs
- Upon modifications to the data
infrastructure
 2-page flyer
 Technical Report: draft at
https://rd-alliance.org/system/files/documents/
RDA-Guidelines_TCDL_draft.pdf
 Reference implementations
(SQL, CSV, XML) and Pilots

Outline
 Recap: Challenges addressed by the WG
 Recommendation of the RDA Working Group
 Pilots and Adoption
 Summary

Making Dynamic Data Citeable
Data Citation: Data + Means-of-access
 Data  time-stamped & versioned (aka history)
Researcher creates working-set via some interface:
 Access  assign PID to QUERY, enhanced with
 Time-stamping for re-execution against versioned DB
 Re-writing for normalization, unique-sort, mapping to history
 Hashing result-set: verifying identity/correctness
leading to landing page
S. Pröll, A. Rauber. Scalable Data Citation in Dynamic Large Databases: Model and Reference Implementation.
In IEEE Intl. Conf. on Big Data 2013 (IEEE BigData2013), 2013
http://www.ifs.tuwien.ac.at/~andi/publications/pdf/pro_ieeebigdata13.pdf

Data Citation – Deployment
 Researcher uses workbench to identify subset of data
 Upon executing selection („download“) user gets
 Data (package, access API, …)
 PID (e.g. DOI) (Query is time-stamped and stored)
 Hash value computed over the data for local storage
 Recommended citation text (e.g. BibTeX)
 PID resolves to landing page
 Provides detailed metadata, link to parent data set, subset,…
 Option to retrieve original data OR current version OR changes
 Upon activating PID associated with a data citation
 Query is re-executed against time-stamped and versioned DB
 Results as above are returned

 Query store aggregates data usage
Note: query string provides excellent
provenance information on the data set!

This is an important advantage over
traditional approaches relying on, e.g.
storing a list of identifiers/DB dump!!!

This is an important advantage over
traditional approaches relying on, e.g.
storing a list of identifiers/DB dump!!!
Identify which parts of the data are used.
If data changes, identify which queries
(studies) are affected

Data Citation – Recommendations
Preparing Data & Query Store
- R1 – Data Versioning
- R2 – Timestamping
- R3 – Query Store
When Data should be persisted
- R4 – Query Uniqueness
- R5 – Stable Sorting
- R6 – Result Set Verification
- R7 – Query Timestamping
- R8 – Query PID
- R9 – Store Query
- R10 – Citation Text
When Resolving a PID
- R11 – Landing Page
- R12 – Machine Actionability
Upon Modifications to the
Data Infrastructure
- R13 – Technology Migration
- R14 – Migration Verification

A) Preparing the Data and the Query Store
 R1 – Data Versioning: Apply versioning to ensure earlier
states of data sets the data can be retrieved
 R2 – Timestamping: Ensure that operations on data are
timestamped, i.e. any additions, deletions are marked with a
timestamp
 R3 – Query Store: Provide means to store the queries and
metadata to re-execute them in the future

B) Persistently Identify Specific Data sets (1/2)
When a data set should be persisted:
 R4 – Query Uniqueness: Re-write the query to a normalized form
so that identical queries can be detected. Compute a checksum of
the normalized query to efficiently detect identical queries
 R5 – Stable Sorting: Ensure an unambiguous sorting of the
records in the data set
 R6 – Result Set Verification: Compute fixity information/checksum
of the query result set to enable verification of the correctness of a
result upon re-execution
 R7 – Query Timestamping: Assign a timestamp to the query
based on the last update to the entire database (or the last update
to the selection of data affected by the query or the query execution
time). This allows retrieving the data as it existed at query time

B) Persistently Identify Specific Data sets (2/2)
When a data set should be persisted:
 R8 – Query PID: Assign a new PID to the query if either the
query is new or if the result set returned from an earlier identical
query is different due to changes in the data. Otherwise, return
the existing PID
 R9 – Store Query: Store query and metadata (e.g. PID, original
and normalized query, query & result set checksum, timestamp,
superset PID, data set description and other) in the query store
 R10 – Citation Text: Provide citation text including the PID in the
format prevalent in the designated community to lower barrier for
citing data.

C) Resolving PIDs and Retrieving Data
 R11 – Landing Page: Make the PIDs resolve to a human
readable landing page that provides the data (via query re-
execution) and metadata, including a link to the superset
(PID of the data source) and citation text snippet
 R12 – Machine Actionability: Provide an API / machine
actionable landing page to access metadata and data via query
re-execution

D) Upon Modifications to the Data Infrastructure
 R13 – Technology Migration: When data is migrated to a new
representation (e.g. new database system, a new schema or a
completely different technology), migrate also the queries and
associated checksums
 R14 – Migration Verification: Verify successful data and query
migration, ensuring that queries can be re-executed correctly

D) Upon Modifications to the Data Infrastructure
 R13 – Technology Migration: When data is migrated to a new
representation (e.g. new database system, a new schema or a
completely different technology), migrate also the queries and
associated checksums
 R14 – Migration Verification: Verify successful query migration
should, ensuring that queries can be re-executed correctly

Benefits
 Retrieval of precise subset with low storage overhead
 Subset as cited or as it is now (including e.g. corrections)
 Query provides provenance information
 Query store supports analysis of data usage
 Checksums support verification
 Same principles applicable across all settings
- Small and large data
- Static and dynamic data
- Different data representations (RDBMS, CSV, XML, LOD, …)
 Would work also for more sophisticated/general
transformations on data beyond select/project

WG Pilots
Name Data Type Status Notes
Timbus RDBMS research finished Sensor data, pilot
XML-Reference XML research finished eXist-DB
DEXHELPP CSV/RDBMS research running Social security data
CSV-Reference CSV/RDBMS reference running - β Reference implem.
GIT-Reference <ASCII> reference running - α Reference implem.
VAMDC SQL/NoSQL/
ASCII -> XML
deployment running Distributed data center
CBMI@wustl RDBMS deployment starting integration into i2b2
CCCA NetCDF deployment starting climate data
ENVRIplus deployment starting ICOS: Carbon Obs.Infr.
ARGO NetCDF deployment starting ODIP-II, RDA-Europe
BCO-DMO CSV deployment starting RDA-US
VMC (Vermont) VMC data cat. deployment starting Forest Research Data
<a few others> CSV, RDBMS deployment planned Conceptual evaluation,
seeking funding

First Pilots for SQL Data
Stefan Pröll, SBA Research
sproell@sba-research.org

SQL Prototype Implementation
 LNEC Laboratory of Civil Engineering, Portugal
 Monitoring dams and bridges
 31 manual sensor instruments
 25 automatic sensor instruments
 Web portal
- Select sensor data
- Define timespans
 Report generation
- Analysis processes
- LaTeX
- publish PDF report
Page 28
Florian Fuchs [CC-BY-3.0 (http://creativecommons.org/licenses/by/3.0)], via Wikimedia
Commons

SQL Time-Stamping and Versioning
 Integrated
- Extend original tables by temporal metadata
- Expand primary key by record-version column
 Hybrid
- Utilize history table for deleted record versions with metadata
- Original table reflects latest version only
 Separated
- Utilizes full history table
- Also inserts reflected in history table
 Solution to be adopted depends on trade-off
- Storage Demand
- Query Complexity
- Software adaption
Page 30

SQL: Storing Queries
Page 31
 Add query store containing
- PID of the query
- Original query
- Re-written query + query string hash
- Timestamp
(as used in re-written query)
- Hash-key of query result
- Metadata useful for citation /
landing page
(creator, institution, rights, …)
- PID of parent dataset
(or using fragment identifiers for query)

SQL Query Re-Writing
32
 Adapt query to history table

Reference Implementation for
CSV Data
Stefan Pröll
sproell@sba-research.org

CSV Prototype: Basic Steps
 Upload interface for CSV files
 2 approaches:
• Migrate CSV file into RDBMS
 Generate table structure, identify primary key
 Add metadata columns for versioning, indices
• Use GIT for data and separate branch for queries
 Dynamic data
 Update / delete existing records
 Append new data
 Access interface
 Track subset creation
 Store queries
Barrymieny

Progress on Data Citation within
VAMDC
C.M. Zwölf and VAMDC Consortium
carlo-maria.zwolf@obspm.fr

Plasma
sciences
Lighting
technologies
Atmospheric
Physics
Environmental
sciences
Fusion
technologies
Health and
clinical
sciences
Astrophysics
VAMDC
Single and
unique access
to
heterogeneous
A+M
Databases
 Federates 28 heterogeneous
databases
http://portal.vamdc.org/
 Distributed infrastructure with no
central management system
 The “V” of VAMDC stands for
Virtual in the sense that the e-
infrastructure does not contain
data. The infrastructure is a
wrapping for exposing in a
unified way a set of
heterogeneous databases.
 Relies on a strong and
sustainable technical and
political organisation.
Virtual Atomic and Molecular Data Centre

VAMDC wrapping
layer  VAMDC
Node
Existing
Independent
A+M
database
Standard vocabulary for
submitting queries
Results provided formatted
into standard XML file
(XSAMS)
VAMDC
RegistryResource
registered
into
VAMDC
Clients
(dispatch
query on all
the registered
resources)
• Portal
• SpecView
• SpectCol
VAMDC wrapping
layer  VAMDC
Node
Existing
Independent
A+M
database
Standard vocabulary for
submitting queries
Results provided formatted
into standard XML file
(XSAMS)
Unique A+M
query
Set of
XSAMS files
Asks for available
resources
VAMDC Infrastructure

Architecture of the query store
Central Log Service
Web Service
• Takes the query ID
• Return the associated results
VAMDC - proposed API for the query store
Web service:
• takes a date and a
query.
• returns a result
identical to the one that
would be obtained by
submitting the query on
the provided date
Web service:
• Takes a query ID
• Returns the query
and the
associated
timestamp.
Web Service
• Takes a query and
a date
• Returns the
associated query
ID.
Versioning on Databases

WG Data Citation Pilot
CBMI @ WUSTL
Leslie McIntosh, Cynthia Hudson Vitale,
Snehil Gupta
Washington University in St.Luis

▪ Center for Biomedical Informatics,
Washington University in St. Louis
▪ Electronic medical health record aggregator i2b2
(Informatics for Integrating Biology and the Bedside)
NIH-funded Health Care System, OS SW
▪ Electronic patient medical records (EMR)
▪ i2b2 instance with de-identified data from local hospitals
and outpatient clinics
▪ Overall approx. 2 billion records
▪ 4 mio patients, 48 mio encounters, 82 mio medications,
674 mio lab results, 385 mio vital sign data, …
▪ Obtained funding to implement WGDC recommendations
▪ Timeframe: 9 months
CBMI @ WUSTL

Join RDA and Working Group WGDC
If you are interested in joining the discussion, contributing a
pilot, wish to establish a data citation solution, …
 Register for the RDA WG on Data Citation:
- Website:
- Mailinglist:
https://rd-alliance.org/node/141/archive-post-mailinglist
 Contact us if you plan to implement the recommendations
 Let us know your feedback, concerns, issues identified, …

Thank you!
Data
Table A
Table B
Query
Query Store
Subsets
PID Provider
PID Store

Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group, by Andreas Rauber

 2-page flyer,
more extensive doc to follow
 14 Recommendations
 Grouped into 4 phases:
- Preparing data and query store
- Persistently identifying specific
data sets
- Upon request of a PID
- Upon modifications to the data
infrastructure
 History
- First presented March 30 2015
- Major revision after workshop
April 20/21
- 4 workshops & presentations
- 2 webinars (June 9, June 24)

Data Citation for ENVRIplus
Ari Asmi
ari.asmi@helsinki.fi

ENVRI Plus – ICOS Data Citation
- Part of ENVRI PLUS data citation Workpackage
- ICOS – Integrated Carbon Observation System
(infrastructure)
Atmosphere Ecosystems Oceans
• Distributed data
production
• Distributed data storage
• Centralized “high
level” data sets
• Updated daily
• Usage wide in Carbon
observation science
• Some NRT
• Some “high level”
data storage

Versioning DB –
OK!
Distributed
Data delivery –
Potential issue
(if users bypass
web interface)
ENVRIplus ICOS Implementation (in progress)

Data Citation for ARGO
(ODIP II Project)
Helen Glaves
hmg@bgs.ac.uk

 Aims & objectives
- Resolve the ambiguity in the syntax for citation of dynamic
data
- Agree and ratify a common syntax for dynamic data citation
- Publish results in authoritative documentation e.g. DataCite
metadata schema
- Implement dynamic data citation for Argo data
Argo data use case

 Argo data held by several international data centres
- IFREMER
- NCEI (formerly NOAA National Climatic Data Center,
the National Geophysical Data Center, and
the National Oceanographic Data Center
- BODC
 Validation of method using a real world exemplar
 Results reported to RDA via DCWG and MDH IG
 Feed into related activities in ODIP, ENVRIplus, EUDAT etc.
Application scenario

Adoption of Data Citation Outcomes
by BCO-DMO, R2R
Cynthia Chandler, Adam Shepherd

 BCO-DMO
- Biological and Chemical Oceanography Data
Management Office (WHOI)
- Curation of marine ecosystem system data
contributed by NSF funded investigators
 R2R
- Rolling Deck to Repository
- Curation of routine, underway data from US
academic fleet, and authoritative expedition catalog
 Members of Marine Data Harmonization IG
US Ocean Science Domain Repositories

BCO-DMO Adoption of Data Citation Outputs
- Evaluation
– Evaluate recommendations
– Try implementation in existing
systems
- Trial
– BCO-DMO: R1-11 fit well with
current architecture; R12 doable;
test as part of DataONE node
membership
– R2R: curation of original field data
and selected subset of post-field
products (ship track); so no evolving
data

CARIACO zooplankton data subset, since 2000
/OCB/CARIACO/Zooplankton.html0?Date>20000101,
Cruise_ID,lon,lat,Date,zoop_DW_200,zoop_ash_200,
zoop_DW_500,zoop_ash_500

 Preserve the data subset
 Request a DOI
 Store data subset, query, and create new landing page
for data subset DOI
BCO-DMO - New capabilities

Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group, by Andreas Rauber

More Related Content

What's hot (20)

Similar to Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group, by Andreas Rauber (20)

More from LEARN Project (20)

Recently uploaded (20)

Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group, by Andreas Rauber