SlideShare a Scribd company logo
Enabling Precise Identification and
Citability of Dynamic Data:
Recommendations of the RDA Working
Group WGDC
Andreas Rauber
Technical University of Vienna
rauber@ifs.tuwien.ac.at
http://www.ifs.tuwien.ac.at/~andi
Outline
 Challenges in Data Identification and Citation
 Recommendation of the RDA Working Group
 Pilots and Adoption
 Summary
Data Citation
Why do we want to precisely identify data
and cite it???
 Citation to give and accumulate credit for the
- Creator of the dataset
- Datacenter hosting the data
- Funder providing funding
- …
 Citation to assist in research
- Reproduce a study
- Compare different models
- Automate meta-studies
- …
Data Citation
 Citing data may seem easy
- from providing a URL in a footnote
- via providing a reference in the bibliography section
- to assigning a PID (DOI, ARK, …) to dataset in a repository
 What’s the problem?
Page 4
Granularity of Data Identification / Citation
 What about the granularity of data to be identified/cited?
- Databases collect enormous amounts of data over time
- Researchers use specific subsets of data
- Need to identify precisely the subset used
 Current approaches
- Citing entire dataset, providing textual description of subset
-> imprecise (ambiguity)
- Storing a copy of subset as used in study -> scalability
- Storing list of record identifiers in subset -> scalability,
not for arbitrary subsets (e.g. when not entire record selected)
 Would like to be able to identify & cite precisely the
subset of (dynamic) data used in a study
Page 5
Citation of Dynamic Data
 Citable datasets have to be static
- Fixed set of data, no changes:
no corrections to errors, no new data being added
 But: (research) data is dynamic
- Adding new data, correcting errors, enhancing data quality, …
- Changes sometimes highly dynamic, at irregular intervals
 Current approaches
- Identifying entire data stream, without any versioning
- Using “accessed at” date
- “Artificial” versioning by identifying batches of data (e.g.
annual), aggregating changes into releases (time-delayed!)
 Would like to cite precisely the data as it existed at certain
point in time, without delaying release of new data
Page 6
Data Citation – Requirements
 Dynamic data
- corrections, additions, …
 Arbitrary subsets of data (granularity)
- rows/columns, time sequences, …
- from single number to the entire set
 Stable across technology changes
- e.g. migration to new database
 Machine-actionable
- not just machine-readable,
definitely not just human-readable and interpretable
 Scalable to very large / highly dynamic datasets
- But: should also work for small and/or static datasets!
 Research Data Alliance
 WG on Data Citation:
Making Dynamic Data Citeable
 March 2014 – Sep 2015
- Concentrating on the problems of
large, dynamic (changing) datasets
- Focus! Identification of data!
Not: PID systems, metadata, citation string, attribution, …
- Liaise with other WGs and initiatives on data citation
(CODATA, DataCite, Force11, …)
- Continuing support for adoption
https://rd-alliance.org/working-groups/data-citation-wg.html
RDA WG Data Citation
Data Citation – Output
 14 Recommendations
grouped into 4 phases:
- Preparing data and query store
- Persistently identifying specific data sets
- Resolving PIDs
- Upon modifications to the data
infrastructure
 2-page flyer
 Technical Report: draft at
https://rd-alliance.org/system/files/documents/
RDA-Guidelines_TCDL_draft.pdf
 Reference implementations
(SQL, CSV, XML) and Pilots
Outline
 Recap: Challenges addressed by the WG
 Recommendation of the RDA Working Group
 Pilots and Adoption
 Summary
Making Dynamic Data Citeable
Data Citation: Data + Means-of-access
 Data  time-stamped & versioned (aka history)
Researcher creates working-set via some interface:
 Access  assign PID to QUERY, enhanced with
 Time-stamping for re-execution against versioned DB
 Re-writing for normalization, unique-sort, mapping to history
 Hashing result-set: verifying identity/correctness
leading to landing page
S. Pröll, A. Rauber. Scalable Data Citation in Dynamic Large Databases: Model and Reference Implementation.
In IEEE Intl. Conf. on Big Data 2013 (IEEE BigData2013), 2013
http://www.ifs.tuwien.ac.at/~andi/publications/pdf/pro_ieeebigdata13.pdf
Data Citation – Deployment
 Researcher uses workbench to identify subset of data
 Upon executing selection („download“) user gets
 Data (package, access API, …)
 PID (e.g. DOI) (Query is time-stamped and stored)
 Hash value computed over the data for local storage
 Recommended citation text (e.g. BibTeX)
 PID resolves to landing page
 Provides detailed metadata, link to parent data set, subset,…
 Option to retrieve original data OR current version OR changes
 Upon activating PID associated with a data citation
 Query is re-executed against time-stamped and versioned DB
 Results as above are returned
Data Citation – Deployment
 Researcher uses workbench to identify subset of data
 Upon executing selection („download“) user gets
 Data (package, access API, …)
 PID (e.g. DOI) (Query is time-stamped and stored)
 Hash value computed over the data for local storage
 Recommended citation text (e.g. BibTeX)
 PID resolves to landing page
 Provides detailed metadata, link to parent data set, subset,…
 Option to retrieve original data OR current version OR changes
 Upon activating PID associated with a data citation
 Query is re-executed against time-stamped and versioned DB
 Results as above are returned
 Query store aggregates data usage
Note: query string provides excellent
provenance information on the data set!
Data Citation – Deployment
 Researcher uses workbench to identify subset of data
 Upon executing selection („download“) user gets
 Data (package, access API, …)
 PID (e.g. DOI) (Query is time-stamped and stored)
 Hash value computed over the data for local storage
 Recommended citation text (e.g. BibTeX)
 PID resolves to landing page
 Provides detailed metadata, link to parent data set, subset,…
 Option to retrieve original data OR current version OR changes
 Upon activating PID associated with a data citation
 Query is re-executed against time-stamped and versioned DB
 Results as above are returned
 Query store aggregates data usage
Note: query string provides excellent
provenance information on the data set!
This is an important advantage over
traditional approaches relying on, e.g.
storing a list of identifiers/DB dump!!!
Data Citation – Deployment
 Researcher uses workbench to identify subset of data
 Upon executing selection („download“) user gets
 Data (package, access API, …)
 PID (e.g. DOI) (Query is time-stamped and stored)
 Hash value computed over the data for local storage
 Recommended citation text (e.g. BibTeX)
 PID resolves to landing page
 Provides detailed metadata, link to parent data set, subset,…
 Option to retrieve original data OR current version OR changes
 Upon activating PID associated with a data citation
 Query is re-executed against time-stamped and versioned DB
 Results as above are returned
 Query store aggregates data usage
Note: query string provides excellent
provenance information on the data set!
This is an important advantage over
traditional approaches relying on, e.g.
storing a list of identifiers/DB dump!!!
Identify which parts of the data are used.
If data changes, identify which queries
(studies) are affected
Data Citation – Recommendations
Preparing Data & Query Store
- R1 – Data Versioning
- R2 – Timestamping
- R3 – Query Store
When Data should be persisted
- R4 – Query Uniqueness
- R5 – Stable Sorting
- R6 – Result Set Verification
- R7 – Query Timestamping
- R8 – Query PID
- R9 – Store Query
- R10 – Citation Text
When Resolving a PID
- R11 – Landing Page
- R12 – Machine Actionability
Upon Modifications to the
Data Infrastructure
- R13 – Technology Migration
- R14 – Migration Verification
Data Citation – Recommendations
A) Preparing the Data and the Query Store
 R1 – Data Versioning: Apply versioning to ensure earlier
states of data sets the data can be retrieved
 R2 – Timestamping: Ensure that operations on data are
timestamped, i.e. any additions, deletions are marked with a
timestamp
 R3 – Query Store: Provide means to store the queries and
metadata to re-execute them in the future
Data Citation – Recommendations
B) Persistently Identify Specific Data sets (1/2)
When a data set should be persisted:
 R4 – Query Uniqueness: Re-write the query to a normalized form
so that identical queries can be detected. Compute a checksum of
the normalized query to efficiently detect identical queries
 R5 – Stable Sorting: Ensure an unambiguous sorting of the
records in the data set
 R6 – Result Set Verification: Compute fixity information/checksum
of the query result set to enable verification of the correctness of a
result upon re-execution
 R7 – Query Timestamping: Assign a timestamp to the query
based on the last update to the entire database (or the last update
to the selection of data affected by the query or the query execution
time). This allows retrieving the data as it existed at query time
Data Citation – Recommendations
B) Persistently Identify Specific Data sets (2/2)
When a data set should be persisted:
 R8 – Query PID: Assign a new PID to the query if either the
query is new or if the result set returned from an earlier identical
query is different due to changes in the data. Otherwise, return
the existing PID
 R9 – Store Query: Store query and metadata (e.g. PID, original
and normalized query, query & result set checksum, timestamp,
superset PID, data set description and other) in the query store
 R10 – Citation Text: Provide citation text including the PID in the
format prevalent in the designated community to lower barrier for
citing data.
Data Citation – Recommendations
C) Resolving PIDs and Retrieving Data
 R11 – Landing Page: Make the PIDs resolve to a human
readable landing page that provides the data (via query re-
execution) and metadata, including a link to the superset
(PID of the data source) and citation text snippet
 R12 – Machine Actionability: Provide an API / machine
actionable landing page to access metadata and data via query
re-execution
Data Citation – Recommendations
D) Upon Modifications to the Data Infrastructure
 R13 – Technology Migration: When data is migrated to a new
representation (e.g. new database system, a new schema or a
completely different technology), migrate also the queries and
associated checksums
 R14 – Migration Verification: Verify successful data and query
migration, ensuring that queries can be re-executed correctly
Data Citation – Recommendations
D) Upon Modifications to the Data Infrastructure
 R13 – Technology Migration: When data is migrated to a new
representation (e.g. new database system, a new schema or a
completely different technology), migrate also the queries and
associated checksums
 R14 – Migration Verification: Verify successful query migration
should, ensuring that queries can be re-executed correctly
Benefits
 Retrieval of precise subset with low storage overhead
 Subset as cited or as it is now (including e.g. corrections)
 Query provides provenance information
 Query store supports analysis of data usage
 Checksums support verification
 Same principles applicable across all settings
- Small and large data
- Static and dynamic data
- Different data representations (RDBMS, CSV, XML, LOD, …)
 Would work also for more sophisticated/general
transformations on data beyond select/project
Outline
 Recap: Challenges addressed by the WG
 Recommendation of the RDA Working Group
 Pilots and Adoption
 Summary
WG Pilots
Name Data Type Status Notes
Timbus RDBMS research finished Sensor data, pilot
XML-Reference XML research finished eXist-DB
DEXHELPP CSV/RDBMS research running Social security data
CSV-Reference CSV/RDBMS reference running - β Reference implem.
GIT-Reference <ASCII> reference running - α Reference implem.
VAMDC SQL/NoSQL/
ASCII -> XML
deployment running Distributed data center
CBMI@wustl RDBMS deployment starting integration into i2b2
CCCA NetCDF deployment starting climate data
ENVRIplus deployment starting ICOS: Carbon Obs.Infr.
ARGO NetCDF deployment starting ODIP-II, RDA-Europe
BCO-DMO CSV deployment starting RDA-US
VMC (Vermont) VMC data cat. deployment starting Forest Research Data
<a few others> CSV, RDBMS deployment planned Conceptual evaluation,
seeking funding
First Pilots for SQL Data
Stefan Pröll, SBA Research
sproell@sba-research.org
SQL Prototype Implementation
 LNEC Laboratory of Civil Engineering, Portugal
 Monitoring dams and bridges
 31 manual sensor instruments
 25 automatic sensor instruments
 Web portal
- Select sensor data
- Define timespans
 Report generation
- Analysis processes
- LaTeX
- publish PDF report
Page 28
Florian Fuchs [CC-BY-3.0 (http://creativecommons.org/licenses/by/3.0)], via Wikimedia
Commons
SQL Time-Stamping and Versioning
 Integrated
- Extend original tables by temporal metadata
- Expand primary key by record-version column
 Hybrid
- Utilize history table for deleted record versions with metadata
- Original table reflects latest version only
 Separated
- Utilizes full history table
- Also inserts reflected in history table
 Solution to be adopted depends on trade-off
- Storage Demand
- Query Complexity
- Software adaption
Page 30
SQL: Storing Queries
Page 31
 Add query store containing
- PID of the query
- Original query
- Re-written query + query string hash
- Timestamp
(as used in re-written query)
- Hash-key of query result
- Metadata useful for citation /
landing page
(creator, institution, rights, …)
- PID of parent dataset
(or using fragment identifiers for query)
SQL Query Re-Writing
32
 Adapt query to history table
Reference Implementation for
CSV Data
Stefan Pröll
sproell@sba-research.org
CSV Prototype: Basic Steps
 Upload interface for CSV files
 2 approaches:
• Migrate CSV file into RDBMS
 Generate table structure, identify primary key
 Add metadata columns for versioning, indices
• Use GIT for data and separate branch for queries
 Dynamic data
 Update / delete existing records
 Append new data
 Access interface
 Track subset creation
 Store queries
Barrymieny
CSV Data Prototype
CSV Data Prototype
CSV Data Prototype
Progress on Data Citation within
VAMDC
C.M. Zwölf and VAMDC Consortium
carlo-maria.zwolf@obspm.fr
Plasma
sciences
Lighting
technologies
Atmospheric
Physics
Environmental
sciences
Fusion
technologies
Health and
clinical
sciences
Astrophysics
VAMDC
Single and
unique access
to
heterogeneous
A+M
Databases
 Federates 28 heterogeneous
databases
http://portal.vamdc.org/
 Distributed infrastructure with no
central management system
 The “V” of VAMDC stands for
Virtual in the sense that the e-
infrastructure does not contain
data. The infrastructure is a
wrapping for exposing in a
unified way a set of
heterogeneous databases.
 Relies on a strong and
sustainable technical and
political organisation.
Virtual Atomic and Molecular Data Centre
VAMDC wrapping
layer  VAMDC
Node
Existing
Independent
A+M
database
Standard vocabulary for
submitting queries
Results provided formatted
into standard XML file
(XSAMS)
VAMDC
RegistryResource
registered
into
VAMDC
Clients
(dispatch
query on all
the registered
resources)
• Portal
• SpecView
• SpectCol
VAMDC wrapping
layer  VAMDC
Node
Existing
Independent
A+M
database
Standard vocabulary for
submitting queries
Results provided formatted
into standard XML file
(XSAMS)
Unique A+M
query
Set of
XSAMS files
Asks for available
resources
VAMDC Infrastructure
Architecture of the query store
Central Log Service
Web Service
• Takes the query ID
• Return the associated results
VAMDC - proposed API for the query store
Web service:
• takes a date and a
query.
• returns a result
identical to the one that
would be obtained by
submitting the query on
the provided date
Web service:
• Takes a query ID
• Returns the query
and the
associated
timestamp.
Web Service
• Takes a query and
a date
• Returns the
associated query
ID.
Versioning on Databases
WG Data Citation Pilot
CBMI @ WUSTL
Leslie McIntosh, Cynthia Hudson Vitale,
Snehil Gupta
Washington University in St.Luis
▪ Center for Biomedical Informatics,
Washington University in St. Louis
▪ Electronic medical health record aggregator i2b2
(Informatics for Integrating Biology and the Bedside)
NIH-funded Health Care System, OS SW
▪ Electronic patient medical records (EMR)
▪ i2b2 instance with de-identified data from local hospitals
and outpatient clinics
▪ Overall approx. 2 billion records
▪ 4 mio patients, 48 mio encounters, 82 mio medications,
674 mio lab results, 385 mio vital sign data, …
▪ Obtained funding to implement WGDC recommendations
▪ Timeframe: 9 months
CBMI @ WUSTL
CBMI @ WUSTL
WG Pilots
Name Data Type Status Notes
Timbus RDBMS research finished Sensor data, pilot
XML-Reference XML research finished eXist-DB
DEXHELPP CSV/RDBMS research running Social security data
CSV-Reference CSV/RDBMS reference running - β Reference implem.
GIT-Reference <ASCII> reference running - α Reference implem.
VAMDC SQL/NoSQL/
ASCII -> XML
deployment running Distributed data center
CBMI@wustl RDBMS deployment starting integration into i2b2
CCCA NetCDF deployment starting climate data
ENVRIplus deployment starting ICOS: Carbon Obs.Infr.
ARGO NetCDF deployment starting ODIP-II, RDA-Europe
BCO-DMO CSV deployment starting RDA-US
VMC (Vermont) VMC data cat. deployment starting Forest Research Data
<a few others> CSV, RDBMS deployment planned Conceptual evaluation,
seeking funding
Outline
 Recap: Challenges addressed by the WG
 Recommendation of the RDA Working Group
 Pilots and Adoption
 Summary
Benefits
 Retrieval of precise subset with low storage overhead
 Subset as cited or as it is now (including e.g. corrections)
 Query provides provenance information
 Query store supports analysis of data usage
 Checksums support verification
 Same principles applicable across all settings
- Small and large data
- Static and dynamic data
- Different data representations (RDBMS, CSV, XML, LOD, …)
 Would work also for more sophisticated/general
transformations on data beyond select/project
Join RDA and Working Group WGDC
If you are interested in joining the discussion, contributing a
pilot, wish to establish a data citation solution, …
 Register for the RDA WG on Data Citation:
- Website:
https://rd-alliance.org/working-groups/data-citation-wg.html
- Mailinglist:
https://rd-alliance.org/node/141/archive-post-mailinglist
 Contact us if you plan to implement the recommendations
 Let us know your feedback, concerns, issues identified, …
Thank you!
https://rd-alliance.org/working-groups/data-citation-wg.html
Data
Table A
Table B
Query
Query Store
Subsets
PID Provider
PID Store
Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group, by Andreas Rauber
Data Citation – Recommendations
 2-page flyer,
more extensive doc to follow
 14 Recommendations
 Grouped into 4 phases:
- Preparing data and query store
- Persistently identifying specific
data sets
- Upon request of a PID
- Upon modifications to the data
infrastructure
 History
- First presented March 30 2015
- Major revision after workshop
April 20/21
- 4 workshops & presentations
- 2 webinars (June 9, June 24)
Data Citation for ENVRIplus
Ari Asmi
ari.asmi@helsinki.fi
ENVRI Plus – ICOS Data Citation
- Part of ENVRI PLUS data citation Workpackage
- ICOS – Integrated Carbon Observation System
(infrastructure)
Atmosphere Ecosystems Oceans
• Distributed data
production
• Distributed data storage
• Centralized “high
level” data sets
• Updated daily
• Usage wide in Carbon
observation science
• Some NRT
• Some “high level”
data storage
Versioning DB –
OK!
Distributed
Data delivery –
Potential issue
(if users bypass
web interface)
ENVRIplus ICOS Implementation (in progress)
Data Citation for ARGO
(ODIP II Project)
Helen Glaves
hmg@bgs.ac.uk
 Aims & objectives
- Resolve the ambiguity in the syntax for citation of dynamic
data
- Agree and ratify a common syntax for dynamic data citation
- Publish results in authoritative documentation e.g. DataCite
metadata schema
- Implement dynamic data citation for Argo data
Argo data use case
 Argo data held by several international data centres
- IFREMER
- NCEI (formerly NOAA National Climatic Data Center,
the National Geophysical Data Center, and
the National Oceanographic Data Center
- BODC
 Validation of method using a real world exemplar
 Results reported to RDA via DCWG and MDH IG
 Feed into related activities in ODIP, ENVRIplus, EUDAT etc.
Application scenario
Adoption of Data Citation Outcomes
by BCO-DMO, R2R
Cynthia Chandler, Adam Shepherd
 BCO-DMO
- Biological and Chemical Oceanography Data
Management Office (WHOI)
- Curation of marine ecosystem system data
contributed by NSF funded investigators
 R2R
- Rolling Deck to Repository
- Curation of routine, underway data from US
academic fleet, and authoritative expedition catalog
 Members of Marine Data Harmonization IG
US Ocean Science Domain Repositories
BCO-DMO Adoption of Data Citation Outputs
- Evaluation
– Evaluate recommendations
– Try implementation in existing
systems
- Trial
– BCO-DMO: R1-11 fit well with
current architecture; R12 doable;
test as part of DataONE node
membership
– R2R: curation of original field data
and selected subset of post-field
products (ship track); so no evolving
data
CARIACO zooplankton data subset, since 2000
/OCB/CARIACO/Zooplankton.html0?Date>20000101,
Cruise_ID,lon,lat,Date,zoop_DW_200,zoop_ash_200,
zoop_DW_500,zoop_ash_500
 Preserve the data subset
 Request a DOI
 Store data subset, query, and create new landing page
for data subset DOI
BCO-DMO - New capabilities

More Related Content

PDF
How can we ensure research data is re-usable? The role of Publishers in Resea...
PPTX
Authority files - Jisc Digital Festival 2014
PPTX
Research data management workshop april12 2016
PPTX
THOR Workshop - Data Publishing PLOS
PPTX
THOR Workshop - Introduction
PPTX
Why does research data matter to libraries
PPTX
Open Science (publishing) as-a-Service (Presentation by Paolo Manghi at the ...
PPTX
THOR Workshop - Data Publishing Elsevier
How can we ensure research data is re-usable? The role of Publishers in Resea...
Authority files - Jisc Digital Festival 2014
Research data management workshop april12 2016
THOR Workshop - Data Publishing PLOS
THOR Workshop - Introduction
Why does research data matter to libraries
Open Science (publishing) as-a-Service (Presentation by Paolo Manghi at the ...
THOR Workshop - Data Publishing Elsevier

What's hot (20)

PPTX
OpenAIRE: eInfrastructure for Open Science
PPTX
The Challenges of Making Data Travel, by Sabina Leonelli
PPTX
LEARN Final Conference: Tutorial Group | Implementing the LEARN RDM Toolkit
PPTX
THOR Workshop - Data Publishing
PDF
Open Science: Research Data Management
PPTX
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
PPTX
Data management: The new frontier for libraries
PPTX
NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...
PDF
Research Data Management and the brave new world, By Paul Ayris
PPTX
Burton - Security, Privacy and Trust
PDF
Data Publishing Models by Sünje Dallmeier-Tiessen
PPTX
Reproducibility (and the R*) of Science: motivations, challenges and trends
PDF
Dataverse in the Universe of Data by Christine L. Borgman
PPTX
The FAIRDOM Commons for Systems Biology
PPT
David Shotton - Research Integrity: Integrity of the published record
PDF
Levine - Data Curation; Ethics and Legal Considerations
PPTX
Stop press: should embargo conditions apply to metadata?
PPT
Scott Edmunds at OASP Asia: Open (and Big) Data – the next challenge
PPTX
2017 05 03 Implementing Pure at UWA - ANDS Webinar Series
PPTX
The Needs of stakeholders in the RDM process - the role of LEARN. By Paul Ayr...
OpenAIRE: eInfrastructure for Open Science
The Challenges of Making Data Travel, by Sabina Leonelli
LEARN Final Conference: Tutorial Group | Implementing the LEARN RDM Toolkit
THOR Workshop - Data Publishing
Open Science: Research Data Management
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
Data management: The new frontier for libraries
NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...
Research Data Management and the brave new world, By Paul Ayris
Burton - Security, Privacy and Trust
Data Publishing Models by Sünje Dallmeier-Tiessen
Reproducibility (and the R*) of Science: motivations, challenges and trends
Dataverse in the Universe of Data by Christine L. Borgman
The FAIRDOM Commons for Systems Biology
David Shotton - Research Integrity: Integrity of the published record
Levine - Data Curation; Ethics and Legal Considerations
Stop press: should embargo conditions apply to metadata?
Scott Edmunds at OASP Asia: Open (and Big) Data – the next challenge
2017 05 03 Implementing Pure at UWA - ANDS Webinar Series
The Needs of stakeholders in the RDM process - the role of LEARN. By Paul Ayr...
Ad

Similar to Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group, by Andreas Rauber (20)

PPT
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
PPT
Labmatrix
PDF
PPTX
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
PDF
Data Infrastructure for a World of Music
PPTX
Is the traditional data warehouse dead?
PPTX
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
PPTX
JOSA TechTalk: Metadata Management
in Big Data
PDF
Hoodie - DataEngConf 2017
PPT
Dw Concepts
PPTX
Odam: Open Data, Access and Mining
PDF
HEPData Open Repositories 2016 Talk
PPTX
Tableau and hadoop
PPT
LECTURE4.ppt
PPT
2004-11-13 Supersite Relational Database Project: (Data Portal?)
PPT
Srds Pres011120
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
PPT
DataFinder concepts and example: General (20100503)
PPTX
Cloud computing major project
PDF
Hpdw 2015-v10-paper
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Labmatrix
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Data Infrastructure for a World of Music
Is the traditional data warehouse dead?
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
JOSA TechTalk: Metadata Management
in Big Data
Hoodie - DataEngConf 2017
Dw Concepts
Odam: Open Data, Access and Mining
HEPData Open Repositories 2016 Talk
Tableau and hadoop
LECTURE4.ppt
2004-11-13 Supersite Relational Database Project: (Data Portal?)
Srds Pres011120
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
DataFinder concepts and example: General (20100503)
Cloud computing major project
Hpdw 2015-v10-paper
Ad

More from LEARN Project (20)

PDF
Research Data Management, Challenges and Tools - Per Öster
PPTX
LEARN Final Conference: Tutorial Group | Using the LEARN Model RDM Policy
PDF
Research Data in an Open Science World - Prof. Dr. Eva Mendez, uc3m
PDF
Data, Science, Society - Claudio Gutierrez, University of Chile
PPTX
LEARN Final Conference: Tutorial Group | How To Engage Early Career Researchers
PPTX
LEARN Final Conference: Tutorial Group | Costing RDM
PPT
Paolo Budroni at COAR Annual Meeting
PDF
LEARN Webinar
PDF
Developing a Framework for Research Data Management Protocols
PDF
The Needs of Stakeholders in the RDM Process - the role of LEARN
PDF
Opening Research Data in EU Universities: Policies, Motivators and Challenges
PDF
About Data From A Machine Learning Perspective
PDF
LEARN Carribean Workshop Opening Remarks
PDF
Managing Research Data in the Caribbean: Good practices and challenges
PDF
LEARN Project: The Story So Far
PDF
The Data Deluge: the Role of Research Organisations
PDF
Data for Development in the Caribbean
PDF
Open Data in a Big World by Fernando Ariel López
PDF
CENTRO DE DATOS
PDF
Research Data Management in São Paulo by Fabio Kon FAPESP
Research Data Management, Challenges and Tools - Per Öster
LEARN Final Conference: Tutorial Group | Using the LEARN Model RDM Policy
Research Data in an Open Science World - Prof. Dr. Eva Mendez, uc3m
Data, Science, Society - Claudio Gutierrez, University of Chile
LEARN Final Conference: Tutorial Group | How To Engage Early Career Researchers
LEARN Final Conference: Tutorial Group | Costing RDM
Paolo Budroni at COAR Annual Meeting
LEARN Webinar
Developing a Framework for Research Data Management Protocols
The Needs of Stakeholders in the RDM Process - the role of LEARN
Opening Research Data in EU Universities: Policies, Motivators and Challenges
About Data From A Machine Learning Perspective
LEARN Carribean Workshop Opening Remarks
Managing Research Data in the Caribbean: Good practices and challenges
LEARN Project: The Story So Far
The Data Deluge: the Role of Research Organisations
Data for Development in the Caribbean
Open Data in a Big World by Fernando Ariel López
CENTRO DE DATOS
Research Data Management in São Paulo by Fabio Kon FAPESP

Recently uploaded (20)

PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Computer network topology notes for revision
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Introduction to Business Data Analytics.
PDF
Lecture1 pattern recognition............
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Launch Your Data Science Career in Kochi – 2025
Supervised vs unsupervised machine learning algorithms
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
STUDY DESIGN details- Lt Col Maksud (21).pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
IBA_Chapter_11_Slides_Final_Accessible.pptx
Computer network topology notes for revision
.pdf is not working space design for the following data for the following dat...
Business Ppt On Nestle.pptx huunnnhhgfvu
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Business Data Analytics.
Lecture1 pattern recognition............
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Clinical guidelines as a resource for EBP(1).pdf
IB Computer Science - Internal Assessment.pptx
Quality review (1)_presentation of this 21
Business Acumen Training GuidePresentation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx

Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group, by Andreas Rauber

  • 1. Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group WGDC Andreas Rauber Technical University of Vienna rauber@ifs.tuwien.ac.at http://www.ifs.tuwien.ac.at/~andi
  • 2. Outline  Challenges in Data Identification and Citation  Recommendation of the RDA Working Group  Pilots and Adoption  Summary
  • 3. Data Citation Why do we want to precisely identify data and cite it???  Citation to give and accumulate credit for the - Creator of the dataset - Datacenter hosting the data - Funder providing funding - …  Citation to assist in research - Reproduce a study - Compare different models - Automate meta-studies - …
  • 4. Data Citation  Citing data may seem easy - from providing a URL in a footnote - via providing a reference in the bibliography section - to assigning a PID (DOI, ARK, …) to dataset in a repository  What’s the problem? Page 4
  • 5. Granularity of Data Identification / Citation  What about the granularity of data to be identified/cited? - Databases collect enormous amounts of data over time - Researchers use specific subsets of data - Need to identify precisely the subset used  Current approaches - Citing entire dataset, providing textual description of subset -> imprecise (ambiguity) - Storing a copy of subset as used in study -> scalability - Storing list of record identifiers in subset -> scalability, not for arbitrary subsets (e.g. when not entire record selected)  Would like to be able to identify & cite precisely the subset of (dynamic) data used in a study Page 5
  • 6. Citation of Dynamic Data  Citable datasets have to be static - Fixed set of data, no changes: no corrections to errors, no new data being added  But: (research) data is dynamic - Adding new data, correcting errors, enhancing data quality, … - Changes sometimes highly dynamic, at irregular intervals  Current approaches - Identifying entire data stream, without any versioning - Using “accessed at” date - “Artificial” versioning by identifying batches of data (e.g. annual), aggregating changes into releases (time-delayed!)  Would like to cite precisely the data as it existed at certain point in time, without delaying release of new data Page 6
  • 7. Data Citation – Requirements  Dynamic data - corrections, additions, …  Arbitrary subsets of data (granularity) - rows/columns, time sequences, … - from single number to the entire set  Stable across technology changes - e.g. migration to new database  Machine-actionable - not just machine-readable, definitely not just human-readable and interpretable  Scalable to very large / highly dynamic datasets - But: should also work for small and/or static datasets!
  • 8.  Research Data Alliance  WG on Data Citation: Making Dynamic Data Citeable  March 2014 – Sep 2015 - Concentrating on the problems of large, dynamic (changing) datasets - Focus! Identification of data! Not: PID systems, metadata, citation string, attribution, … - Liaise with other WGs and initiatives on data citation (CODATA, DataCite, Force11, …) - Continuing support for adoption https://rd-alliance.org/working-groups/data-citation-wg.html RDA WG Data Citation
  • 9. Data Citation – Output  14 Recommendations grouped into 4 phases: - Preparing data and query store - Persistently identifying specific data sets - Resolving PIDs - Upon modifications to the data infrastructure  2-page flyer  Technical Report: draft at https://rd-alliance.org/system/files/documents/ RDA-Guidelines_TCDL_draft.pdf  Reference implementations (SQL, CSV, XML) and Pilots
  • 10. Outline  Recap: Challenges addressed by the WG  Recommendation of the RDA Working Group  Pilots and Adoption  Summary
  • 11. Making Dynamic Data Citeable Data Citation: Data + Means-of-access  Data  time-stamped & versioned (aka history) Researcher creates working-set via some interface:  Access  assign PID to QUERY, enhanced with  Time-stamping for re-execution against versioned DB  Re-writing for normalization, unique-sort, mapping to history  Hashing result-set: verifying identity/correctness leading to landing page S. Pröll, A. Rauber. Scalable Data Citation in Dynamic Large Databases: Model and Reference Implementation. In IEEE Intl. Conf. on Big Data 2013 (IEEE BigData2013), 2013 http://www.ifs.tuwien.ac.at/~andi/publications/pdf/pro_ieeebigdata13.pdf
  • 12. Data Citation – Deployment  Researcher uses workbench to identify subset of data  Upon executing selection („download“) user gets  Data (package, access API, …)  PID (e.g. DOI) (Query is time-stamped and stored)  Hash value computed over the data for local storage  Recommended citation text (e.g. BibTeX)  PID resolves to landing page  Provides detailed metadata, link to parent data set, subset,…  Option to retrieve original data OR current version OR changes  Upon activating PID associated with a data citation  Query is re-executed against time-stamped and versioned DB  Results as above are returned
  • 13. Data Citation – Deployment  Researcher uses workbench to identify subset of data  Upon executing selection („download“) user gets  Data (package, access API, …)  PID (e.g. DOI) (Query is time-stamped and stored)  Hash value computed over the data for local storage  Recommended citation text (e.g. BibTeX)  PID resolves to landing page  Provides detailed metadata, link to parent data set, subset,…  Option to retrieve original data OR current version OR changes  Upon activating PID associated with a data citation  Query is re-executed against time-stamped and versioned DB  Results as above are returned  Query store aggregates data usage Note: query string provides excellent provenance information on the data set!
  • 14. Data Citation – Deployment  Researcher uses workbench to identify subset of data  Upon executing selection („download“) user gets  Data (package, access API, …)  PID (e.g. DOI) (Query is time-stamped and stored)  Hash value computed over the data for local storage  Recommended citation text (e.g. BibTeX)  PID resolves to landing page  Provides detailed metadata, link to parent data set, subset,…  Option to retrieve original data OR current version OR changes  Upon activating PID associated with a data citation  Query is re-executed against time-stamped and versioned DB  Results as above are returned  Query store aggregates data usage Note: query string provides excellent provenance information on the data set! This is an important advantage over traditional approaches relying on, e.g. storing a list of identifiers/DB dump!!!
  • 15. Data Citation – Deployment  Researcher uses workbench to identify subset of data  Upon executing selection („download“) user gets  Data (package, access API, …)  PID (e.g. DOI) (Query is time-stamped and stored)  Hash value computed over the data for local storage  Recommended citation text (e.g. BibTeX)  PID resolves to landing page  Provides detailed metadata, link to parent data set, subset,…  Option to retrieve original data OR current version OR changes  Upon activating PID associated with a data citation  Query is re-executed against time-stamped and versioned DB  Results as above are returned  Query store aggregates data usage Note: query string provides excellent provenance information on the data set! This is an important advantage over traditional approaches relying on, e.g. storing a list of identifiers/DB dump!!! Identify which parts of the data are used. If data changes, identify which queries (studies) are affected
  • 16. Data Citation – Recommendations Preparing Data & Query Store - R1 – Data Versioning - R2 – Timestamping - R3 – Query Store When Data should be persisted - R4 – Query Uniqueness - R5 – Stable Sorting - R6 – Result Set Verification - R7 – Query Timestamping - R8 – Query PID - R9 – Store Query - R10 – Citation Text When Resolving a PID - R11 – Landing Page - R12 – Machine Actionability Upon Modifications to the Data Infrastructure - R13 – Technology Migration - R14 – Migration Verification
  • 17. Data Citation – Recommendations A) Preparing the Data and the Query Store  R1 – Data Versioning: Apply versioning to ensure earlier states of data sets the data can be retrieved  R2 – Timestamping: Ensure that operations on data are timestamped, i.e. any additions, deletions are marked with a timestamp  R3 – Query Store: Provide means to store the queries and metadata to re-execute them in the future
  • 18. Data Citation – Recommendations B) Persistently Identify Specific Data sets (1/2) When a data set should be persisted:  R4 – Query Uniqueness: Re-write the query to a normalized form so that identical queries can be detected. Compute a checksum of the normalized query to efficiently detect identical queries  R5 – Stable Sorting: Ensure an unambiguous sorting of the records in the data set  R6 – Result Set Verification: Compute fixity information/checksum of the query result set to enable verification of the correctness of a result upon re-execution  R7 – Query Timestamping: Assign a timestamp to the query based on the last update to the entire database (or the last update to the selection of data affected by the query or the query execution time). This allows retrieving the data as it existed at query time
  • 19. Data Citation – Recommendations B) Persistently Identify Specific Data sets (2/2) When a data set should be persisted:  R8 – Query PID: Assign a new PID to the query if either the query is new or if the result set returned from an earlier identical query is different due to changes in the data. Otherwise, return the existing PID  R9 – Store Query: Store query and metadata (e.g. PID, original and normalized query, query & result set checksum, timestamp, superset PID, data set description and other) in the query store  R10 – Citation Text: Provide citation text including the PID in the format prevalent in the designated community to lower barrier for citing data.
  • 20. Data Citation – Recommendations C) Resolving PIDs and Retrieving Data  R11 – Landing Page: Make the PIDs resolve to a human readable landing page that provides the data (via query re- execution) and metadata, including a link to the superset (PID of the data source) and citation text snippet  R12 – Machine Actionability: Provide an API / machine actionable landing page to access metadata and data via query re-execution
  • 21. Data Citation – Recommendations D) Upon Modifications to the Data Infrastructure  R13 – Technology Migration: When data is migrated to a new representation (e.g. new database system, a new schema or a completely different technology), migrate also the queries and associated checksums  R14 – Migration Verification: Verify successful data and query migration, ensuring that queries can be re-executed correctly
  • 22. Data Citation – Recommendations D) Upon Modifications to the Data Infrastructure  R13 – Technology Migration: When data is migrated to a new representation (e.g. new database system, a new schema or a completely different technology), migrate also the queries and associated checksums  R14 – Migration Verification: Verify successful query migration should, ensuring that queries can be re-executed correctly
  • 23. Benefits  Retrieval of precise subset with low storage overhead  Subset as cited or as it is now (including e.g. corrections)  Query provides provenance information  Query store supports analysis of data usage  Checksums support verification  Same principles applicable across all settings - Small and large data - Static and dynamic data - Different data representations (RDBMS, CSV, XML, LOD, …)  Would work also for more sophisticated/general transformations on data beyond select/project
  • 24. Outline  Recap: Challenges addressed by the WG  Recommendation of the RDA Working Group  Pilots and Adoption  Summary
  • 25. WG Pilots Name Data Type Status Notes Timbus RDBMS research finished Sensor data, pilot XML-Reference XML research finished eXist-DB DEXHELPP CSV/RDBMS research running Social security data CSV-Reference CSV/RDBMS reference running - β Reference implem. GIT-Reference <ASCII> reference running - α Reference implem. VAMDC SQL/NoSQL/ ASCII -> XML deployment running Distributed data center CBMI@wustl RDBMS deployment starting integration into i2b2 CCCA NetCDF deployment starting climate data ENVRIplus deployment starting ICOS: Carbon Obs.Infr. ARGO NetCDF deployment starting ODIP-II, RDA-Europe BCO-DMO CSV deployment starting RDA-US VMC (Vermont) VMC data cat. deployment starting Forest Research Data <a few others> CSV, RDBMS deployment planned Conceptual evaluation, seeking funding
  • 26. First Pilots for SQL Data Stefan Pröll, SBA Research sproell@sba-research.org
  • 27. SQL Prototype Implementation  LNEC Laboratory of Civil Engineering, Portugal  Monitoring dams and bridges  31 manual sensor instruments  25 automatic sensor instruments  Web portal - Select sensor data - Define timespans  Report generation - Analysis processes - LaTeX - publish PDF report Page 28 Florian Fuchs [CC-BY-3.0 (http://creativecommons.org/licenses/by/3.0)], via Wikimedia Commons
  • 28. SQL Time-Stamping and Versioning  Integrated - Extend original tables by temporal metadata - Expand primary key by record-version column  Hybrid - Utilize history table for deleted record versions with metadata - Original table reflects latest version only  Separated - Utilizes full history table - Also inserts reflected in history table  Solution to be adopted depends on trade-off - Storage Demand - Query Complexity - Software adaption Page 30
  • 29. SQL: Storing Queries Page 31  Add query store containing - PID of the query - Original query - Re-written query + query string hash - Timestamp (as used in re-written query) - Hash-key of query result - Metadata useful for citation / landing page (creator, institution, rights, …) - PID of parent dataset (or using fragment identifiers for query)
  • 30. SQL Query Re-Writing 32  Adapt query to history table
  • 31. Reference Implementation for CSV Data Stefan Pröll sproell@sba-research.org
  • 32. CSV Prototype: Basic Steps  Upload interface for CSV files  2 approaches: • Migrate CSV file into RDBMS  Generate table structure, identify primary key  Add metadata columns for versioning, indices • Use GIT for data and separate branch for queries  Dynamic data  Update / delete existing records  Append new data  Access interface  Track subset creation  Store queries Barrymieny
  • 36. Progress on Data Citation within VAMDC C.M. Zwölf and VAMDC Consortium carlo-maria.zwolf@obspm.fr
  • 37. Plasma sciences Lighting technologies Atmospheric Physics Environmental sciences Fusion technologies Health and clinical sciences Astrophysics VAMDC Single and unique access to heterogeneous A+M Databases  Federates 28 heterogeneous databases http://portal.vamdc.org/  Distributed infrastructure with no central management system  The “V” of VAMDC stands for Virtual in the sense that the e- infrastructure does not contain data. The infrastructure is a wrapping for exposing in a unified way a set of heterogeneous databases.  Relies on a strong and sustainable technical and political organisation. Virtual Atomic and Molecular Data Centre
  • 38. VAMDC wrapping layer  VAMDC Node Existing Independent A+M database Standard vocabulary for submitting queries Results provided formatted into standard XML file (XSAMS) VAMDC RegistryResource registered into VAMDC Clients (dispatch query on all the registered resources) • Portal • SpecView • SpectCol VAMDC wrapping layer  VAMDC Node Existing Independent A+M database Standard vocabulary for submitting queries Results provided formatted into standard XML file (XSAMS) Unique A+M query Set of XSAMS files Asks for available resources VAMDC Infrastructure
  • 39. Architecture of the query store Central Log Service Web Service • Takes the query ID • Return the associated results VAMDC - proposed API for the query store Web service: • takes a date and a query. • returns a result identical to the one that would be obtained by submitting the query on the provided date Web service: • Takes a query ID • Returns the query and the associated timestamp. Web Service • Takes a query and a date • Returns the associated query ID. Versioning on Databases
  • 40. WG Data Citation Pilot CBMI @ WUSTL Leslie McIntosh, Cynthia Hudson Vitale, Snehil Gupta Washington University in St.Luis
  • 41. ▪ Center for Biomedical Informatics, Washington University in St. Louis ▪ Electronic medical health record aggregator i2b2 (Informatics for Integrating Biology and the Bedside) NIH-funded Health Care System, OS SW ▪ Electronic patient medical records (EMR) ▪ i2b2 instance with de-identified data from local hospitals and outpatient clinics ▪ Overall approx. 2 billion records ▪ 4 mio patients, 48 mio encounters, 82 mio medications, 674 mio lab results, 385 mio vital sign data, … ▪ Obtained funding to implement WGDC recommendations ▪ Timeframe: 9 months CBMI @ WUSTL
  • 43. WG Pilots Name Data Type Status Notes Timbus RDBMS research finished Sensor data, pilot XML-Reference XML research finished eXist-DB DEXHELPP CSV/RDBMS research running Social security data CSV-Reference CSV/RDBMS reference running - β Reference implem. GIT-Reference <ASCII> reference running - α Reference implem. VAMDC SQL/NoSQL/ ASCII -> XML deployment running Distributed data center CBMI@wustl RDBMS deployment starting integration into i2b2 CCCA NetCDF deployment starting climate data ENVRIplus deployment starting ICOS: Carbon Obs.Infr. ARGO NetCDF deployment starting ODIP-II, RDA-Europe BCO-DMO CSV deployment starting RDA-US VMC (Vermont) VMC data cat. deployment starting Forest Research Data <a few others> CSV, RDBMS deployment planned Conceptual evaluation, seeking funding
  • 44. Outline  Recap: Challenges addressed by the WG  Recommendation of the RDA Working Group  Pilots and Adoption  Summary
  • 45. Benefits  Retrieval of precise subset with low storage overhead  Subset as cited or as it is now (including e.g. corrections)  Query provides provenance information  Query store supports analysis of data usage  Checksums support verification  Same principles applicable across all settings - Small and large data - Static and dynamic data - Different data representations (RDBMS, CSV, XML, LOD, …)  Would work also for more sophisticated/general transformations on data beyond select/project
  • 46. Join RDA and Working Group WGDC If you are interested in joining the discussion, contributing a pilot, wish to establish a data citation solution, …  Register for the RDA WG on Data Citation: - Website: https://rd-alliance.org/working-groups/data-citation-wg.html - Mailinglist: https://rd-alliance.org/node/141/archive-post-mailinglist  Contact us if you plan to implement the recommendations  Let us know your feedback, concerns, issues identified, …
  • 49. Data Citation – Recommendations  2-page flyer, more extensive doc to follow  14 Recommendations  Grouped into 4 phases: - Preparing data and query store - Persistently identifying specific data sets - Upon request of a PID - Upon modifications to the data infrastructure  History - First presented March 30 2015 - Major revision after workshop April 20/21 - 4 workshops & presentations - 2 webinars (June 9, June 24)
  • 50. Data Citation for ENVRIplus Ari Asmi ari.asmi@helsinki.fi
  • 51. ENVRI Plus – ICOS Data Citation - Part of ENVRI PLUS data citation Workpackage - ICOS – Integrated Carbon Observation System (infrastructure) Atmosphere Ecosystems Oceans • Distributed data production • Distributed data storage • Centralized “high level” data sets • Updated daily • Usage wide in Carbon observation science • Some NRT • Some “high level” data storage
  • 52. Versioning DB – OK! Distributed Data delivery – Potential issue (if users bypass web interface) ENVRIplus ICOS Implementation (in progress)
  • 53. Data Citation for ARGO (ODIP II Project) Helen Glaves hmg@bgs.ac.uk
  • 54.  Aims & objectives - Resolve the ambiguity in the syntax for citation of dynamic data - Agree and ratify a common syntax for dynamic data citation - Publish results in authoritative documentation e.g. DataCite metadata schema - Implement dynamic data citation for Argo data Argo data use case
  • 55.  Argo data held by several international data centres - IFREMER - NCEI (formerly NOAA National Climatic Data Center, the National Geophysical Data Center, and the National Oceanographic Data Center - BODC  Validation of method using a real world exemplar  Results reported to RDA via DCWG and MDH IG  Feed into related activities in ODIP, ENVRIplus, EUDAT etc. Application scenario
  • 56. Adoption of Data Citation Outcomes by BCO-DMO, R2R Cynthia Chandler, Adam Shepherd
  • 57.  BCO-DMO - Biological and Chemical Oceanography Data Management Office (WHOI) - Curation of marine ecosystem system data contributed by NSF funded investigators  R2R - Rolling Deck to Repository - Curation of routine, underway data from US academic fleet, and authoritative expedition catalog  Members of Marine Data Harmonization IG US Ocean Science Domain Repositories
  • 58. BCO-DMO Adoption of Data Citation Outputs - Evaluation – Evaluate recommendations – Try implementation in existing systems - Trial – BCO-DMO: R1-11 fit well with current architecture; R12 doable; test as part of DataONE node membership – R2R: curation of original field data and selected subset of post-field products (ship track); so no evolving data
  • 59. CARIACO zooplankton data subset, since 2000 /OCB/CARIACO/Zooplankton.html0?Date>20000101, Cruise_ID,lon,lat,Date,zoop_DW_200,zoop_ash_200, zoop_DW_500,zoop_ash_500
  • 60.  Preserve the data subset  Request a DOI  Store data subset, query, and create new landing page for data subset DOI BCO-DMO - New capabilities