Data Citation Implementation Guidelines By Tim Clark

Joint Declaration of Data Citation
Principles
© 2015 Massachusetts General Hospital
and FORCE11.org
Tim Clark, Ph.D.
Assistant Professor of Neurology
Massachusetts General Hospital & Harvard Medical School
June 9, 2015

Non-reproduciblity
11%
Begley CG and Ellis LM, Nature 2012, 483(7391):531-533

Transparency and
Reproducibility
• Transparency is the basis of reproducibility
• What we are aiming for is robust science
• Validation from multiple orthogonal viewpoints
• Focus on transparent communication of results

Joint Declaration of Data Citation Principles
endorsed by over 90 scholarly organizations

The Brief JDDCP
1. Importance. Data are first-
class objects.
2. Credit. Support citing all
contributors to the data.
3. Evidence. Assertions must
be traceable to evidence.
4. Unique ID. Cited datasets
must have resolvable IDs.
5. Access. Data must be
robustly archived.
6. Persistence. Metadata must
persist even after data is gone.
7. Specificity & Verifiability.
Get same dynamic time-slice.
8. Interoperable & flexible.
Give cross-community support.

JDDCP
Archival, id
& retrieval
Document
model
Archival &
retrieval
Archival &
retrieval
Identification
Common
APIs
Workflows
Metadata

repositories
social science
biomedicine
earth science
climatology
scholarly publishing
web standards
scientific data standards
astronomy
physics
academic libraries
data science
software technology
physics
biomedicine
Archival &
retrieval

Human and machine accessibility of
cited data in scholarly publications
© 2015 Massachusetts General Hospital
and FORCE11.org
Tim Clark, Ph.D.
Assistant Professor of Neurology
Massachusetts General Hospital & Harvard Medical School
June 9, 2015

or, how to store and access cited
data to radically improve scholarly
transparency - and so that BOTH
humans and machines are happy.

PeerJ Computer Science 1:e1. https://dx.doi.org/10.7717/peerj-cs.1

Basic guidelines
1. Cite data as you would cite publications.
2. Deposit data in an archival-quality repository.
3. Use an identifier scheme meeting JDDCP
criteria.
4. Identifiers should resolve to a landing page,
not directly to the data.
5. Landing pages describe the data in both
human and machine readable form.

Basic guidelines (contd.)
6. Landing page & data retention may differ.
7. Repositories should provide specific
guarantee of landing page persistence.
8. Landing pages should provide both human
and machine interpretable information.
9. Provide web service accessibility.
10. Stakeholder responsibilities for ecosystem.

1. Cite data as you would
cite publications
• Strongly preferred:
• Use the NISO JATS revision 1.1d2 XML schema
• Interim (less good) alternative:
• Use own XML schema, but do what JATS does.

2. Deposit data in archival
quality repositories
Examples:
• NIH and EBI bioscience repositories;
• Standard earth/space/physical science repositories;
• Dataverse, Dryad, Figshare, Zenodo; etc.
Unacceptable:
• “Available on my laboratory website”.

3. Use an ID scheme that meets
JDDCP criteria (4-6)
Any currently‐available identifier scheme that is:
• Machine actionable,
• Globally unique,
• Widely used by a community, and
• Has a long term commitment to persistence
Best practice:
• use a scheme that is cross-discipline, such as
DOI.

Machine accessibility
Machine accessibility in this context means:
“access by well-documented Web services—preferably
RESTful Web services—to data and metadata stored in
a robust repository, independently of integrated browser
access by humans.”

Commitment to persistence
If a resolving authority is required, that authority has
demonstrated a reasonable chance to be present and
functional in the future;
Owner of the domain or the resolving authority has
made a credible commitment to ensure that its
identifiers will always resolve.
A useful survey of persistent identifier schemes
appears in Hilse & Kothe (2006).

• Digital Object Identifiers (DOIs)

4. Identifiers should resolve to a
landing page, not directly to data
Because:
• Data may be de-accessioned, like books, but
the description of thing cited should remain;
• Data may be restricted (e.g. Protected Health
Information; specially-licensed data; etc.);
• Data may be VERY large and user needs to
be able to decide whether to download or not.
• Content negotiation for machine access!

5. Landing pages describe the data
Best practices:
• Identifier, title, description, creator,
publisher/contact, publication/release date,
version.
Additional:
• Creator identifier (e.g. ORCID), license
Content encoding:
• HTML; plus…
• At least one non-proprietary machine-readable
format, e.g. XML, JSON/JSON-LD, RDF,
microformats, microdata, RDFa,…

Serving the landing pages
“To enable automated agents to extract the metadata
these landing pages should include an HTML <link>
element specifying a machine readable form of the
page as an alternative.”
“For those that are capable of doing so, we
recommend also using Web Linking (Nottingham,
2010) to provide this information from all of the
alternative formats.”

6. Landing page retention may differ
from data retention
Because:
• Repositories cannot commit to keeping
arbitrary and possibly very large volumes of
data forever!
• But when data is de-accessioned, the citation
identifier must not give a 404 error.
• Retain awareness of what was cited even if it
is not currently extant in a particular repository.

7. Repositories should provide a
specific guarantee of persistence for
landing pages
Model guarantee language:
“[Organization/Institution Name] is committed to maintaining
persistent identifiers in [Repository Name] so that they will
continue to resolve to a landing page providing metadata
describing the data, including elements of stewardship,
provenance, and availability.
[Organization/Institution Name] has made the following plan
for organizational persistence and succession [plan]

8. Landing pages should provide
both human and machine
interpretable information.
Because:
• Mash-ups and distributed search.
• Apps that you haven’t yet thought of.
• Web services.
Examples of machine interpretable info:
•.RDF, RDFa, XML, microformats, JSON-LD,
etc.

9. Provide web service accessibility
Because:
• Service composition, new apps, etc.
Best practice:
•.RESTful web service, because this is a data-
oriented application and required functionality.
Much less good practice:
• SOAP, because SOAP is process-oriented.

10. Stakeholder
responsibilities
• Archives and repositories: Ids, resolution, landing
page metadata, dataset description, data access
methods conform to these recommendations.
• Registries of repositories: Document conformance.
• Researchers: Treat data as first-class objects.
• Funders, scholarly societies, academic institutions:
Strongly encourage conformance to best practices.

Summary
• Use NISO JATS 1.1d2 to publish & archive documents.
• Cite datasets as if they were publications and deposit
datasets in archival repositories.
• Follow human & machine accessibility guidelines as
presented above in points 3 through 9.
• Adhere to stakeholder responsibilities as in point 10.
• Welcome to the future of scholarly publishing!

Acknowledgements
• Joan Starr, California Digital Library
• other co-authors of the “Achieving Human and Machine
Accessibility” publication
• FORCE11 Data Citation Implementation Group
• Maryann Martone, UCSD & FORCE11
• John Kunze, California Digital Library
• Harry Hochheiser, University of Pittsburgh
• Phil Bourne, NIH Data Science Directorate

Data Citation Implementation Guidelines By Tim Clark

More Related Content

What's hot (20)

Similar to Data Citation Implementation Guidelines By Tim Clark (20)

More from datascienceiqss (18)

Recently uploaded (20)

Data Citation Implementation Guidelines By Tim Clark