The Rise of Data Publishing
in the Digital World
(and how Dataverse and DataTags help)
Mercè Crosas, Ph.D.
Chief Data Science and Technology Officer
Institute for Quantitive Social Science
Harvard University
@mercecrosas
NDSR 2016 Symposium
From 1665 to late 20th century:
A steady increase in size and
complexity of research output
The number of journals doubles every 20 years
since 1750s, with growth of number of scientists
1665 1765 1865 1965
100
10000
Mabe, 2003
The number of journals doubles every 20 years
since 1750s, with growth of number of scientists
1700: 3 journals
1665 1765 1865 1965
100
10000
Mabe, 2003
The number of journals doubles every 20 years
since 1750s, with growth of number of scientists
1700: 3 journals
1800: ~10 journals
1665 1765 1865 1965
100
10000
Mabe, 2003
The number of journals doubles every 20 years
since 1750s, with growth of number of scientists
1700: 3 journals
1800: ~10 journals
1900: ~400 journals
1665 1765 1865 1965
100
10000
Mabe, 2003
The number of journals doubles every 20 years
since 1750s, with growth of number of scientists
1700: 3 journals
1800: ~10 journals
1900: ~400 journals
2000: ~14,000 journals
(peer-reviewed)
1665 1765 1865 1965
100
10000
Mabe, 2003
1665 1765 1865 1965
100
10000
Data Tables andVisuals Become Increasingly
Common, and part of the Scientific Argument
1665 1765 1865 1965
100
10000
Data Tables andVisuals Become Increasingly
Common, and part of the Scientific Argument
a few tables &
visuals, as part of
the text
1665 1765 1865 1965
100
10000
Data Tables andVisuals Become Increasingly
Common, and part of the Scientific Argument
a few tables &
visuals, as part of
the text 50% cite previous
work
1665 1765 1865 1965
100
10000
Data Tables andVisuals Become Increasingly
Common, and part of the Scientific Argument
a few tables &
visuals, as part of
the text 50% cite previous
work
First Line Graphs
and bar charts
(Playfair, 1786)
1665 1765 1865 1965
100
10000
Data Tables andVisuals Become Increasingly
Common, and part of the Scientific Argument
a few tables &
visuals, as part of
the text
50% of articles have
tables & figures
50% cite previous
work
First Line Graphs
and bar charts
(Playfair, 1786)
1665 1765 1865 1965
100
10000
Data Tables andVisuals Become Increasingly
Common, and part of the Scientific Argument
a few tables &
visuals, as part of
the text
50% of articles have
tables & figures
50% cite previous
work
method sections
appear
First Line Graphs
and bar charts
(Playfair, 1786)
1665 1765 1865 1965
100
10000
Data Tables andVisuals Become Increasingly
Common, and part of the Scientific Argument
a few tables &
visuals, as part of
the text
50% of articles have
tables & figures
50% cite previous
work
method sections
appear
First Line Graphs
and bar charts
(Playfair, 1786)
First Scatterplots
(Hershel,1833;
Galton 1896)
1665 1765 1865 1965
100
10000
Data Tables andVisuals Become Increasingly
Common, and part of the Scientific Argument
a few tables &
visuals, as part of
the text
50% of articles have
tables & figures
most articles
have tables &
figures, often
standalone
50% cite previous
work
method sections
appear
First Line Graphs
and bar charts
(Playfair, 1786)
First Scatterplots
(Hershel,1833;
Galton 1896)
1665 1765 1865 1965
100
10000
Data Tables andVisuals Become Increasingly
Common, and part of the Scientific Argument
a few tables &
visuals, as part of
the text
50% of articles have
tables & figures
most articles
have tables &
figures, often
standalone
50% cite previous
work
100% with citations
(1 per 100 words)
part of scholarly credit
method sections
appear
First Line Graphs
and bar charts
(Playfair, 1786)
First Scatterplots
(Hershel,1833;
Galton 1896)
1665 1765 1865 1965
100
10000
Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
• 19th century:
Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
• 19th century:
• explain data instead of establish observations of facts
Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
• 19th century:
• explain data instead of establish observations of facts
• wide use of visuals, high citation density, methods section
Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
• 19th century:
• explain data instead of establish observations of facts
• wide use of visuals, high citation density, methods section
• 20th century:
Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
• 19th century:
• explain data instead of establish observations of facts
• wide use of visuals, high citation density, methods section
• 20th century:
• structured quantitative data with increased use of statistics
Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
• 19th century:
• explain data instead of establish observations of facts
• wide use of visuals, high citation density, methods section
• 20th century:
• structured quantitative data with increased use of statistics
• wide range of data types with new technologies
Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
• 19th century:
• explain data instead of establish observations of facts
• wide use of visuals, high citation density, methods section
• 20th century:
• structured quantitative data with increased use of statistics
• wide range of data types with new technologies
• Number of scientists increases from 100s to a few millions
Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
• 19th century:
• explain data instead of establish observations of facts
• wide use of visuals, high citation density, methods section
• 20th century:
• structured quantitative data with increased use of statistics
• wide range of data types with new technologies
• Number of scientists increases from 100s to a few millions
• Science becomes extremely specialized:
Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
• 19th century:
• explain data instead of establish observations of facts
• wide use of visuals, high citation density, methods section
• 20th century:
• structured quantitative data with increased use of statistics
• wide range of data types with new technologies
• Number of scientists increases from 100s to a few millions
• Science becomes extremely specialized:
• from 1 journal to 14,000 peer-reviewed journals
Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
• 19th century:
• explain data instead of establish observations of facts
• wide use of visuals, high citation density, methods section
• 20th century:
• structured quantitative data with increased use of statistics
• wide range of data types with new technologies
• Number of scientists increases from 100s to a few millions
• Science becomes extremely specialized:
• from 1 journal to 14,000 peer-reviewed journals
• one new journal for each 150 authors, read by 500
In the last decades, more
and more publications
and data
A Steeper Growth of Scholarly Output
Since 1950, the total number of journals doubles every ~15 years
2010: 80,000 journals
2010: 33,000 peer-reviewed
An Outburst of Research Data and Specialization,
Results into > 1000 Community Repositories
An Outburst of Research Data and Specialization,
Results into > 1000 Community Repositories
1920 - 1950s
An Outburst of Research Data and Specialization,
Results into > 1000 Community Repositories
First Social Science
Data Archives
(ODUM, ICPSR, ...)
1920 - 1950s
An Outburst of Research Data and Specialization,
Results into > 1000 Community Repositories
First Social Science
Data Archives
(ODUM, ICPSR, ...)
1920 - 1950s 1970 - 1980s
An Outburst of Research Data and Specialization,
Results into > 1000 Community Repositories
First Social Science
Data Archives
(ODUM, ICPSR, ...)
First Biomedical
Databases
(PDB, GenBank, ...)
1920 - 1950s 1970 - 1980s
An Outburst of Research Data and Specialization,
Results into > 1000 Community Repositories
First Social Science
Data Archives
(ODUM, ICPSR, ...)
First Biomedical
Databases
(PDB, GenBank, ...)
1920 - 1950s 1970 - 1980s 2016
An Outburst of Research Data and Specialization,
Results into > 1000 Community Repositories
First Social Science
Data Archives
(ODUM, ICPSR, ...)
A wide range of
Research Data
Repositories
First Biomedical
Databases
(PDB, GenBank, ...)
1920 - 1950s 1970 - 1980s 2016
An Outburst of Research Data and Specialization,
Results into > 1000 Community Repositories
First Social Science
Data Archives
(ODUM, ICPSR, ...)
A wide range of
Research Data
Repositories
First Biomedical
Databases
(PDB, GenBank, ...)
1500 repositories listed in re3data.org
1920 - 1950s 1970 - 1980s 2016
Data Publishing Emerges as the Union of
Scholarly Publishing and Data Archiving
Data Publishing Emerges as the Union of
Scholarly Publishing and Data Archiving
Scholarly publishing:
Distribute research output
Data Publishing Emerges as the Union of
Scholarly Publishing and Data Archiving
Scholarly publishing:
Distribute research output
• Attribution and credit
Data Publishing Emerges as the Union of
Scholarly Publishing and Data Archiving
Scholarly publishing:
Distribute research output
• Attribution and credit
• Dissemination
Data Publishing Emerges as the Union of
Scholarly Publishing and Data Archiving
Scholarly publishing:
Distribute research output
• Attribution and credit
• Dissemination
• Finding & Reuse
Data Publishing Emerges as the Union of
Scholarly Publishing and Data Archiving
Scholarly publishing:
Distribute research output
• Attribution and credit
• Dissemination
• Finding & Reuse
Data Archiving:
Long-term access to data
Data Publishing Emerges as the Union of
Scholarly Publishing and Data Archiving
Scholarly publishing:
Distribute research output
• Attribution and credit
• Dissemination
• Finding & Reuse
Data Archiving:
Long-term access to data
• Accessibility
Data Publishing Emerges as the Union of
Scholarly Publishing and Data Archiving
Scholarly publishing:
Distribute research output
• Attribution and credit
• Dissemination
• Finding & Reuse
Data Archiving:
Long-term access to data
• Accessibility
• Preservation
Data Publishing Emerges as the Union of
Scholarly Publishing and Data Archiving
Scholarly publishing:
Distribute research output
• Attribution and credit
• Dissemination
• Finding & Reuse
Data Archiving:
Long-term access to data
• Accessibility
• Preservation
• Finding & Reuse
Why Data Publishing now?
Why Data Publishing now?
Extending Gross et al. thesis, data publishing accommodates the
complexity of research input and output in the digital world.
Why Data Publishing now?
Extending Gross et al. thesis, data publishing accommodates the
complexity of research input and output in the digital world.
Why Data Publishing now?
• Data (and software) have become common input and
output of research
Extending Gross et al. thesis, data publishing accommodates the
complexity of research input and output in the digital world.
Why Data Publishing now?
• Data (and software) have become common input and
output of research
• A scholarly article cannot hold or describe accurately these
vast amounts of data and software
Extending Gross et al. thesis, data publishing accommodates the
complexity of research input and output in the digital world.
Why Data Publishing now?
• Data (and software) have become common input and
output of research
• A scholarly article cannot hold or describe accurately these
vast amounts of data and software
• As input and output of research, data must be citable and
accessible to enable validation and reuse, with attribution
Extending Gross et al. thesis, data publishing accommodates the
complexity of research input and output in the digital world.
What is needed for FAIR Data Publishing
FAIR = Findable Accessible Interoperable Reusable
What is needed for FAIR Data Publishing
Data Citation
FAIR = Findable Accessible Interoperable Reusable
What is needed for FAIR Data Publishing
Data Citation
• Persistent id to
reference data uniquely
FAIR = Findable Accessible Interoperable Reusable
What is needed for FAIR Data Publishing
Data Citation
• Persistent id to
reference data uniquely
• Support for versions
and fixity
FAIR = Findable Accessible Interoperable Reusable
What is needed for FAIR Data Publishing
Data Citation
• Persistent id to
reference data uniquely
• Support for versions
and fixity
• Attribution to authors
and repository
FAIR = Findable Accessible Interoperable Reusable
What is needed for FAIR Data Publishing
Data Citation
• Persistent id to
reference data uniquely
• Support for versions
and fixity
• Attribution to authors
and repository
Metadata
FAIR = Findable Accessible Interoperable Reusable
What is needed for FAIR Data Publishing
Data Citation
• Persistent id to
reference data uniquely
• Support for versions
and fixity
• Attribution to authors
and repository
Metadata
• Catalog to discover and
locate the data
FAIR = Findable Accessible Interoperable Reusable
What is needed for FAIR Data Publishing
Data Citation
• Persistent id to
reference data uniquely
• Support for versions
and fixity
• Attribution to authors
and repository
Metadata
• Catalog to discover and
locate the data
• Sufficient information to
understand and reuse the
data
FAIR = Findable Accessible Interoperable Reusable
What is needed for FAIR Data Publishing
Data Citation
• Persistent id to
reference data uniquely
• Support for versions
and fixity
• Attribution to authors
and repository
Metadata
• Catalog to discover and
locate the data
• Sufficient information to
understand and reuse the
data
Repository
FAIR = Findable Accessible Interoperable Reusable
What is needed for FAIR Data Publishing
Data Citation
• Persistent id to
reference data uniquely
• Support for versions
and fixity
• Attribution to authors
and repository
Metadata
• Catalog to discover and
locate the data
• Sufficient information to
understand and reuse the
data
Repository
• Digital access to metadata
and data
FAIR = Findable Accessible Interoperable Reusable
What is needed for FAIR Data Publishing
Data Citation
• Persistent id to
reference data uniquely
• Support for versions
and fixity
• Attribution to authors
and repository
Metadata
• Catalog to discover and
locate the data
• Sufficient information to
understand and reuse the
data
Repository
• Digital access to metadata
and data
• Archive and preservation for
long-term access
FAIR = Findable Accessible Interoperable Reusable
What is needed for FAIR Data Publishing
Data Citation
• Persistent id to
reference data uniquely
• Support for versions
and fixity
• Attribution to authors
and repository
Metadata
• Catalog to discover and
locate the data
• Sufficient information to
understand and reuse the
data
Repository
• Digital access to metadata
and data
• Archive and preservation for
long-term access
• Interoperability through
standards and APIs
FAIR = Findable Accessible Interoperable Reusable
The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)
A data repository system that serves as a
solution for publishing FAIR research data
Around the World
Dataverse repositories serve a community, an institution, an archive, ...
Around the World
Harvard Dataverse:
Generic data repository open
to researchers world wide
Dataverse repositories serve a community, an institution, an archive, ...
Dataverses contain datasets,
datasets contain metadata and data files
Data Citation in Dataverse
Data Citation in Dataverse
Published
Year
Dataset
Title
Global
Persistent
Identifier
Repository
= Data Publisher
Version (or
time range)
Authors
Data Citation Basics
Force11, Joint Declaration of Data Citation Principles; Starr et al, 2015
Data Citation Basics
Force11, Joint Declaration of Data Citation Principles; Starr et al, 2015
The dataset landing page is accessible and guaranteed by the repository
(or data publisher), even when data are restricted or deaccessioned
Metadata In Dataverse
Metadata In Dataverse
Citation Metadata
author, title, repository,
year published, version,
etc
• Dublin Core
• DataCite
Domain-specific
Metadata
data collection info
(methods, organism,
observation, survey,
experiment, etc)
• DDI (social sciences)
• ISA-Tab BioCaddie (biomed)
• Virtual Observatory (astro)
• + Custom metadata blocks
File-level Metadata
metadata inside the data
file (variables, instrument
details, geospatial info,
etc)
• DDI (for variables),
• + more to be determined
Fields StandardsMetadata Level
Metadata In Dataverse
Citation Metadata
author, title, repository,
year published, version,
etc
• Dublin Core
• DataCite
Domain-specific
Metadata
data collection info
(methods, organism,
observation, survey,
experiment, etc)
• DDI (social sciences)
• ISA-Tab BioCaddie (biomed)
• Virtual Observatory (astro)
• + Custom metadata blocks
File-level Metadata
metadata inside the data
file (variables, instrument
details, geospatial info,
etc)
• DDI (for variables),
• + more to be determined
Fields StandardsMetadata Level
Metadata In Dataverse
Citation Metadata
author, title, repository,
year published, version,
etc
• Dublin Core
• DataCite
Domain-specific
Metadata
data collection info
(methods, organism,
observation, survey,
experiment, etc)
• DDI (social sciences)
• ISA-Tab BioCaddie (biomed)
• Virtual Observatory (astro)
• + Custom metadata blocks
File-level Metadata
metadata inside the data
file (variables, instrument
details, geospatial info,
etc)
• DDI (for variables),
• + more to be determined
Fields StandardsMetadata Level
Metadata In Dataverse
Citation Metadata
author, title, repository,
year published, version,
etc
• Dublin Core
• DataCite
Domain-specific
Metadata
data collection info
(methods, organism,
observation, survey,
experiment, etc)
• DDI (social sciences)
• ISA-Tab BioCaddie (biomed)
• Virtual Observatory (astro)
• + Custom metadata blocks
File-level Metadata
metadata inside the data
file (variables, instrument
details, geospatial info,
etc)
• DDI (for variables),
• + more to be determined
Fields StandardsMetadata Level
DataverseJSONSchema
Information Extraction:Tabular Files
Information Extraction:Tabular Files
RData
Stata
SPSS
Excel
CSV
var 1 var 2 var 3
obs 1 2 a 0
obs 2 4 c 0
obs 3 6 b 1
obs 4 1 e 0
obs 5 2 a 1
obs 6 3 b 1
Information Extraction:Tabular Files
RData
Stata
SPSS
Excel
CSV
var 1 var 2 var 3
obs 1 2 a 0
obs 2 4 c 0
obs 3 6 b 1
obs 4 1 e 0
obs 5 2 a 1
obs 6 3 b 1
Variable Metadata:
Variable name, label,
type, stats, geospatial
coordinates
Information Extraction:Tabular Files
RData
Stata
SPSS
Excel
CSV
var 1 var 2 var 3
obs 1 2 a 0
obs 2 4 c 0
obs 3 6 b 1
obs 4 1 e 0
obs 5 2 a 1
obs 6 3 b 1
Variable Metadata:
Variable name, label,
type, stats, geospatial
coordinates
2 a 0
4 c 0
6 b 1
1 e 0
2 a 1
3 b 1
DataValues:
Independent of format
Information Extraction:Tabular Files
RData
Stata
SPSS
Excel
CSV
var 1 var 2 var 3
obs 1 2 a 0
obs 2 4 c 0
obs 3 6 b 1
obs 4 1 e 0
obs 5 2 a 1
obs 6 3 b 1
Variable Metadata:
Variable name, label,
type, stats, geospatial
coordinates
2 a 0
4 c 0
6 b 1
1 e 0
2 a 1
3 b 1
DataValues:
Independent of format
Universal Numerical Fingerprint (UNF):
checksum on data values, from canonical format
Information Extraction: FITS (astro) Files
Information Extraction: FITS (astro) Files
Information Extraction: FITS (astro) Files
Header Metadata:
coordinates (R.A.,
declination),
photometric info, ...
Information Extraction: FITS (astro) Files
Header Metadata:
coordinates (R.A.,
declination),
photometric info, ...
Data Objects:
•Image Files
•Spectra
•Data cubes
•Tables
•...
In addition to data citation and
metadata features, Dataverse
has a rich set of features that
facilitate data publishing
Tiered Access
Tiered Access
Open (default):
CC0
Open Open Click to Download
GuestBook Open Open
Fill in guestbook before
download
Terms of Use Open Open
Click through terms of
use before download
Data Restricted Open Restricted Request Access via
click through
Data Restricted Open Restricted
Request Access via
application
Metadata Files How to Access
Tiered Access
Open (default):
CC0
Open Open Click to Download
GuestBook Open Open
Fill in guestbook before
download
Terms of Use Open Open
Click through terms of
use before download
Data Restricted Open Restricted Request Access via
click through
Data Restricted Open Restricted
Request Access via
application
Metadata Files How to Access
Tiered Access
Open (default):
CC0
Open Open Click to Download
GuestBook Open Open
Fill in guestbook before
download
Terms of Use Open Open
Click through terms of
use before download
Data Restricted Open Restricted Request Access via
click through
Data Restricted Open Restricted
Request Access via
application
Metadata Files How to Access
Tiered Access
Open (default):
CC0
Open Open Click to Download
GuestBook Open Open
Fill in guestbook before
download
Terms of Use Open Open
Click through terms of
use before download
Data Restricted Open Restricted Request Access via
click through
Data Restricted Open Restricted
Request Access via
application
Metadata Files How to Access
Tiered Access
Open (default):
CC0
Open Open Click to Download
GuestBook Open Open
Fill in guestbook before
download
Terms of Use Open Open
Click through terms of
use before download
Data Restricted Open Restricted Request Access via
click through
Data Restricted Open Restricted
Request Access via
application
Metadata Files How to Access
Data Publishing Workflows
Data Publishing Workflows
Create Dataset
(landing page
restricted)
Data Publishing Workflows
Create Dataset
(landing page
restricted)
Review
(collaborators or
anonymous reviewers)
Data Publishing Workflows
Create Dataset
(landing page
restricted)
Publish v. 1
Review
(collaborators or
anonymous reviewers)
Data Publishing Workflows
Create Dataset
(landing page
restricted)
Publish v. 1
Review
(collaborators or
anonymous reviewers)
Minor change
(metadata only)
Data Publishing Workflows
Create Dataset
(landing page
restricted)
Publish v. 1
Review
(collaborators or
anonymous reviewers)
Minor change
(metadata only)
Data Publishing Workflows
Create Dataset
(landing page
restricted)
Publish v. 1
Review
(collaborators or
anonymous reviewers)
Minor change
(metadata only)
Publish v. 1.1
Data Publishing Workflows
Create Dataset
(landing page
restricted)
Publish v. 1
Review
(collaborators or
anonymous reviewers)
Minor change
(metadata only)
Publish v. 1.1
Major change
(might include new
data file)
Data Publishing Workflows
Create Dataset
(landing page
restricted)
Publish v. 1
Review
(collaborators or
anonymous reviewers)
Minor change
(metadata only)
Publish v. 1.1
Major change
(might include new
data file)
Data Publishing Workflows
Create Dataset
(landing page
restricted)
Publish v. 1
Review
(collaborators or
anonymous reviewers)
Minor change
(metadata only)
Publish v. 1.1
Major change
(might include new
data file)
Publish v. 2
And more at dataverse.org guides ...
Biomedical Dataverse addresses data
publication of large files: SBGridData
The Biomedical Dataverse at Harvard Medical School -
also tested as a persistent repository for LINCS data
(NIH Library of Integrated Network based Cellular Signatures)
Collaboration with Piotr Sliz and Caroline Shamu (HMS)
(NIH Library of Integrated Network-based Cellular Signatures)
The Biomedical Dataverse at Harvard Medical School -
also tested as a persistent repository for LINCS data
(NIH Library of Integrated Network based Cellular Signatures)
Collaboration with Piotr Sliz and Caroline Shamu (HMS)
(NIH Library of Integrated Network-based Cellular Signatures)
An additional challenge
for data publishing:
Sensitive Data
“User	
  Uploads	
  must	
  be	
  void	
  of	
  all	
  iden4fiable	
  
informa4on,	
  such	
  that	
  re-­‐iden4fica4on	
  of	
  any	
  subjects	
  
from	
  the	
  amalgama4on	
  of	
  the	
  informa4on	
  available	
  
from	
  all	
  of	
  the	
  materials	
  (across	
  datasets	
  and	
  
dataverses)	
  uploaded	
  under	
  any	
  one	
  author	
  and/or	
  
user	
  should	
  not	
  be	
  possible.”
“SubmiCer	
  represents	
  and	
  warrants	
  that	
  the	
  Content	
  
does	
  not	
  contain	
  any	
  informa4on	
  (i)	
  which	
  iden4fies,	
  or	
  
which	
  can	
  be	
  used	
  in	
  conjunc4on	
  with	
  other	
  publicly	
  
available	
  informa4on	
  to	
  personally	
  iden4fy,	
  any	
  
individual;”
“If	
  you	
  are	
  submiHng	
  human	
  sequences	
  to	
  GenBank,	
  
do	
  not	
  include	
  any	
  data	
  that	
  could	
  reveal	
  the	
  personal	
  
iden4ty	
  of	
  the	
  source.	
  It	
  is	
  our	
  assump4on	
  that	
  you	
  
have	
  received	
  any	
  necessary	
  informed	
  consent	
  
authoriza4ons	
  that	
  your	
  organiza4ons	
  require	
  prior	
  to	
  
submiHng	
  your	
  sequences.”
GenBank
How can we maximize
publishing sensitive data while
being mindful of privacy?
Sweeney	
  L,	
  Crosas	
  M,	
  Bar-­‐Sinai	
  M.	
  Sharing	
  Sensi4ve	
  Data	
  with	
  Confidence:	
  The	
  DataTags	
  System.	
  Technology	
  Science.	
  2015101601.	
  
October	
  16,	
  2015.	
  hCp://techscience.org/a/2015101601
The DataTags System
The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)
A datatag is a set of security features and access
requirements for file handling
A datatag is a set of security features and access
requirements for file handling
A datatags repository is one that stores and shares
data files in accordance with a standardized and
ordered levels of security and access requirements
Datatags&Levels&
Tag$Type$ Descrip-on$ Security$Features$ Access$Requirements$
Blue$ Public& Clear&storage&
Clear&transmission&
&
Open&
Green$ Controlled$
public&
Clear&storage&
Clear&transmission&
Email,&OAuth&verified&
registra:on&
Yellow$ Accountable& Clear&storage&
Encrypted&transmit&
Password,&Registered&,&
Approval,&Click&DUA&
Orange$ More$
accountable&
Encrypted&storage&
Encrypted&transmit&
Password,&Registered,&
Approval,&Signed&DUA&
Red$ Fully$
accountable&
Encrypted&storage&
Encrypted&transmit&
TwoDfactor&authen:ca:on,&
Approval,&Signed&DUA&
Crimson$ Maximally$
restricted&
Mul:Encrypt&store&
Encrypted&transmit&
TwoDfactor&authen:ca:on,&
Approval,&Signed&DUA&
DataTags Workflow in a Dataverse Repository
(under development)
Data$File$
Inges-on$
Sensi-ve$
Dataset$
Direct$
Access$
Privacy$
Preserving$
Access$
Automa-c$
Interview$$
Review$Board$
Approval$
hCp://datatags.org
hCp://privacytools.seas.harvard.edu
Two-­‐factor	
  
Authen4ca4on;
Signed	
  DUA
Example of DataTags Interview
Example of DataTags Interview
Example of DataTags Interview
Example of DataTags Interview
Example of DataTags Interview
Example of DataTags Interview
Thanks!
And join us to this year’s
Dataverse Community Meeting
References
• http://dataverse.org
• http://dataverse.harvard.edu
• http://datatags.org
• Sweeney L, Crosas M, Bar-Sinai M. 2015, Sharing
Sensitive Data with Confidence:The DataTags System.
Technology Science, hCp://techscience.org/a/2015101601
• Gross Harmon, Reidy, 2001, Communicating Science
• Mabe,	
  2003,	
  The	
  Growth	
  and	
  Number	
  of	
  Journals
• Friendly,	
  2006,	
  A	
  Brief	
  History	
  of	
  Data	
  Visualiza4on

More Related Content

PDF
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
PPTX
The Dataverse Commons
PDF
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
PDF
Making Data Accessible
PPTX
De angelis 2019 the evolution of scientific literature and of the scientist i...
PPTX
A very Brief History of Communicating Science
PPTX
Scientific Writing Lecture 1- Introduction to Scientific Writing, Peer Review...
PDF
Episode 3(2): Automating storage, management & retrieval of knowledge - Meetu...
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
The Dataverse Commons
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Making Data Accessible
De angelis 2019 the evolution of scientific literature and of the scientist i...
A very Brief History of Communicating Science
Scientific Writing Lecture 1- Introduction to Scientific Writing, Peer Review...
Episode 3(2): Automating storage, management & retrieval of knowledge - Meetu...

Similar to The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help) (20)

PPTX
Open Science - Paradigm Shift or Revival of Old Ideas?
PDF
Kochalko,"Why we should stop worrying about high impact journal indicators an...
PDF
Reputation, impact, and the role of libraries in the world of open science
PDF
On community-standards, data curation and scholarly communication" Stanford M...
PPTX
Reproducibility
PPT
Atmiya university. shree m n virani college of science 14 oct 2021. researc...
PPTX
Publication in scientific journals. Impact factors
PDF
Deflating Information From Science Studies To Documentation Bernd Frohmann
PPTX
It’s publishing but not as you know it: How Open is Changing Everything
PPTX
Impact Factor and the Evaluation of Scientists - a book chapter by Nicola de ...
PDF
History and the future of scholarly publishing (1)
PDF
Haustein, S. (2017). The evolution of scholarly communication and the reward ...
PPTX
Rare (and emergent) disciplines in the light of science studies
PDF
Introductionto bibliometrics
DOC
Sci 2011 big_data(30_may13)2nd revised _ loet
PPTX
Interactive Visualization Systems and Data Integration Methods for Supporting...
PDF
Modern Tools & Rationales for 21st Century Research
DOC
Decomposing Social and Semantic Networks in Emerging “Big Data” Research
PDF
Scientific Interactions and Research Evaluation: From Bibliometrics to Altmet...
PDF
GSmith Springer Nature Data policies and practices: HKU Open Data and Data Pu...
Open Science - Paradigm Shift or Revival of Old Ideas?
Kochalko,"Why we should stop worrying about high impact journal indicators an...
Reputation, impact, and the role of libraries in the world of open science
On community-standards, data curation and scholarly communication" Stanford M...
Reproducibility
Atmiya university. shree m n virani college of science 14 oct 2021. researc...
Publication in scientific journals. Impact factors
Deflating Information From Science Studies To Documentation Bernd Frohmann
It’s publishing but not as you know it: How Open is Changing Everything
Impact Factor and the Evaluation of Scientists - a book chapter by Nicola de ...
History and the future of scholarly publishing (1)
Haustein, S. (2017). The evolution of scholarly communication and the reward ...
Rare (and emergent) disciplines in the light of science studies
Introductionto bibliometrics
Sci 2011 big_data(30_may13)2nd revised _ loet
Interactive Visualization Systems and Data Integration Methods for Supporting...
Modern Tools & Rationales for 21st Century Research
Decomposing Social and Semantic Networks in Emerging “Big Data” Research
Scientific Interactions and Research Evaluation: From Bibliometrics to Altmet...
GSmith Springer Nature Data policies and practices: HKU Open Data and Data Pu...
Ad

More from Merce Crosas (20)

PDF
Practical Implementation of research data policies: Solutions with Dataverse
PDF
Research Data Management @Harvard
PPTX
Cloud Dataverse: A Data repository platform for an OpenStack Cloud
PDF
Can data access combat fake news?
PDF
Data Repositories Impact
PDF
Dataverse, Cloud Dataverse, and DataTags
PDF
FAIR Data Management and FAIR Data Sharing
PDF
The Data Lifecycle (Harvard DataFest)
PDF
Cloud Dataverse
PDF
Abcd iqs ssoftware-projects-mercecrosas
PDF
The DataTags System: Sharing Sensitive Data with Confidence
PDF
Connecting Dataverse with the Research Life Cycle
PDF
Data Citation Implementation at Dataverse
PPTX
Dataverse on the MOC
PPTX
Data Publishing at Harvard's Research Data Access Symposium
PDF
Dataverse hpdm symposium
PDF
Collaboration in science and technology it summit
PPTX
Dataverse for Journals
PPTX
Collaboration in science and technology
PPTX
Force11 jddcp intro
Practical Implementation of research data policies: Solutions with Dataverse
Research Data Management @Harvard
Cloud Dataverse: A Data repository platform for an OpenStack Cloud
Can data access combat fake news?
Data Repositories Impact
Dataverse, Cloud Dataverse, and DataTags
FAIR Data Management and FAIR Data Sharing
The Data Lifecycle (Harvard DataFest)
Cloud Dataverse
Abcd iqs ssoftware-projects-mercecrosas
The DataTags System: Sharing Sensitive Data with Confidence
Connecting Dataverse with the Research Life Cycle
Data Citation Implementation at Dataverse
Dataverse on the MOC
Data Publishing at Harvard's Research Data Access Symposium
Dataverse hpdm symposium
Collaboration in science and technology it summit
Dataverse for Journals
Collaboration in science and technology
Force11 jddcp intro
Ad

Recently uploaded (20)

PDF
Session 11 - Data Visualization Storytelling (2).pdf
PPTX
New ISO 27001_2022 standard and the changes
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
Tapan_20220802057_Researchinternship_final_stage.pptx
PDF
A biomechanical Functional analysis of the masitary muscles in man
PPTX
MBA JAPAN: 2025 the University of Waseda
PPTX
Machine Learning and working of machine Learning
PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPTX
statsppt this is statistics ppt for giving knowledge about this topic
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PDF
An essential collection of rules designed to help businesses manage and reduc...
PPT
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
chrmotography.pptx food anaylysis techni
Session 11 - Data Visualization Storytelling (2).pdf
New ISO 27001_2022 standard and the changes
retention in jsjsksksksnbsndjddjdnFPD.pptx
IMPACT OF LANDSLIDE.....................
Tapan_20220802057_Researchinternship_final_stage.pptx
A biomechanical Functional analysis of the masitary muscles in man
MBA JAPAN: 2025 the University of Waseda
Machine Learning and working of machine Learning
1 hour to get there before the game is done so you don’t need a car seat for ...
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
statsppt this is statistics ppt for giving knowledge about this topic
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
An essential collection of rules designed to help businesses manage and reduc...
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
chrmotography.pptx food anaylysis techni

The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

  • 1. The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science and Technology Officer Institute for Quantitive Social Science Harvard University @mercecrosas NDSR 2016 Symposium
  • 2. From 1665 to late 20th century: A steady increase in size and complexity of research output
  • 3. The number of journals doubles every 20 years since 1750s, with growth of number of scientists 1665 1765 1865 1965 100 10000 Mabe, 2003
  • 4. The number of journals doubles every 20 years since 1750s, with growth of number of scientists 1700: 3 journals 1665 1765 1865 1965 100 10000 Mabe, 2003
  • 5. The number of journals doubles every 20 years since 1750s, with growth of number of scientists 1700: 3 journals 1800: ~10 journals 1665 1765 1865 1965 100 10000 Mabe, 2003
  • 6. The number of journals doubles every 20 years since 1750s, with growth of number of scientists 1700: 3 journals 1800: ~10 journals 1900: ~400 journals 1665 1765 1865 1965 100 10000 Mabe, 2003
  • 7. The number of journals doubles every 20 years since 1750s, with growth of number of scientists 1700: 3 journals 1800: ~10 journals 1900: ~400 journals 2000: ~14,000 journals (peer-reviewed) 1665 1765 1865 1965 100 10000 Mabe, 2003
  • 8. 1665 1765 1865 1965 100 10000
  • 9. Data Tables andVisuals Become Increasingly Common, and part of the Scientific Argument 1665 1765 1865 1965 100 10000
  • 10. Data Tables andVisuals Become Increasingly Common, and part of the Scientific Argument a few tables & visuals, as part of the text 1665 1765 1865 1965 100 10000
  • 11. Data Tables andVisuals Become Increasingly Common, and part of the Scientific Argument a few tables & visuals, as part of the text 50% cite previous work 1665 1765 1865 1965 100 10000
  • 12. Data Tables andVisuals Become Increasingly Common, and part of the Scientific Argument a few tables & visuals, as part of the text 50% cite previous work First Line Graphs and bar charts (Playfair, 1786) 1665 1765 1865 1965 100 10000
  • 13. Data Tables andVisuals Become Increasingly Common, and part of the Scientific Argument a few tables & visuals, as part of the text 50% of articles have tables & figures 50% cite previous work First Line Graphs and bar charts (Playfair, 1786) 1665 1765 1865 1965 100 10000
  • 14. Data Tables andVisuals Become Increasingly Common, and part of the Scientific Argument a few tables & visuals, as part of the text 50% of articles have tables & figures 50% cite previous work method sections appear First Line Graphs and bar charts (Playfair, 1786) 1665 1765 1865 1965 100 10000
  • 15. Data Tables andVisuals Become Increasingly Common, and part of the Scientific Argument a few tables & visuals, as part of the text 50% of articles have tables & figures 50% cite previous work method sections appear First Line Graphs and bar charts (Playfair, 1786) First Scatterplots (Hershel,1833; Galton 1896) 1665 1765 1865 1965 100 10000
  • 16. Data Tables andVisuals Become Increasingly Common, and part of the Scientific Argument a few tables & visuals, as part of the text 50% of articles have tables & figures most articles have tables & figures, often standalone 50% cite previous work method sections appear First Line Graphs and bar charts (Playfair, 1786) First Scatterplots (Hershel,1833; Galton 1896) 1665 1765 1865 1965 100 10000
  • 17. Data Tables andVisuals Become Increasingly Common, and part of the Scientific Argument a few tables & visuals, as part of the text 50% of articles have tables & figures most articles have tables & figures, often standalone 50% cite previous work 100% with citations (1 per 100 words) part of scholarly credit method sections appear First Line Graphs and bar charts (Playfair, 1786) First Scatterplots (Hershel,1833; Galton 1896) 1665 1765 1865 1965 100 10000
  • 18. Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001)
  • 19. Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001) • 18th century:
  • 20. Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001) • 18th century: • formal components appear in articles (introduction, conclusions, table, figures, citations)
  • 21. Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001) • 18th century: • formal components appear in articles (introduction, conclusions, table, figures, citations) • 19th century:
  • 22. Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001) • 18th century: • formal components appear in articles (introduction, conclusions, table, figures, citations) • 19th century: • explain data instead of establish observations of facts
  • 23. Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001) • 18th century: • formal components appear in articles (introduction, conclusions, table, figures, citations) • 19th century: • explain data instead of establish observations of facts • wide use of visuals, high citation density, methods section
  • 24. Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001) • 18th century: • formal components appear in articles (introduction, conclusions, table, figures, citations) • 19th century: • explain data instead of establish observations of facts • wide use of visuals, high citation density, methods section • 20th century:
  • 25. Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001) • 18th century: • formal components appear in articles (introduction, conclusions, table, figures, citations) • 19th century: • explain data instead of establish observations of facts • wide use of visuals, high citation density, methods section • 20th century: • structured quantitative data with increased use of statistics
  • 26. Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001) • 18th century: • formal components appear in articles (introduction, conclusions, table, figures, citations) • 19th century: • explain data instead of establish observations of facts • wide use of visuals, high citation density, methods section • 20th century: • structured quantitative data with increased use of statistics • wide range of data types with new technologies
  • 27. Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001) • 18th century: • formal components appear in articles (introduction, conclusions, table, figures, citations) • 19th century: • explain data instead of establish observations of facts • wide use of visuals, high citation density, methods section • 20th century: • structured quantitative data with increased use of statistics • wide range of data types with new technologies • Number of scientists increases from 100s to a few millions
  • 28. Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001) • 18th century: • formal components appear in articles (introduction, conclusions, table, figures, citations) • 19th century: • explain data instead of establish observations of facts • wide use of visuals, high citation density, methods section • 20th century: • structured quantitative data with increased use of statistics • wide range of data types with new technologies • Number of scientists increases from 100s to a few millions • Science becomes extremely specialized:
  • 29. Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001) • 18th century: • formal components appear in articles (introduction, conclusions, table, figures, citations) • 19th century: • explain data instead of establish observations of facts • wide use of visuals, high citation density, methods section • 20th century: • structured quantitative data with increased use of statistics • wide range of data types with new technologies • Number of scientists increases from 100s to a few millions • Science becomes extremely specialized: • from 1 journal to 14,000 peer-reviewed journals
  • 30. Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001) • 18th century: • formal components appear in articles (introduction, conclusions, table, figures, citations) • 19th century: • explain data instead of establish observations of facts • wide use of visuals, high citation density, methods section • 20th century: • structured quantitative data with increased use of statistics • wide range of data types with new technologies • Number of scientists increases from 100s to a few millions • Science becomes extremely specialized: • from 1 journal to 14,000 peer-reviewed journals • one new journal for each 150 authors, read by 500
  • 31. In the last decades, more and more publications and data
  • 32. A Steeper Growth of Scholarly Output Since 1950, the total number of journals doubles every ~15 years 2010: 80,000 journals 2010: 33,000 peer-reviewed
  • 33. An Outburst of Research Data and Specialization, Results into > 1000 Community Repositories
  • 34. An Outburst of Research Data and Specialization, Results into > 1000 Community Repositories 1920 - 1950s
  • 35. An Outburst of Research Data and Specialization, Results into > 1000 Community Repositories First Social Science Data Archives (ODUM, ICPSR, ...) 1920 - 1950s
  • 36. An Outburst of Research Data and Specialization, Results into > 1000 Community Repositories First Social Science Data Archives (ODUM, ICPSR, ...) 1920 - 1950s 1970 - 1980s
  • 37. An Outburst of Research Data and Specialization, Results into > 1000 Community Repositories First Social Science Data Archives (ODUM, ICPSR, ...) First Biomedical Databases (PDB, GenBank, ...) 1920 - 1950s 1970 - 1980s
  • 38. An Outburst of Research Data and Specialization, Results into > 1000 Community Repositories First Social Science Data Archives (ODUM, ICPSR, ...) First Biomedical Databases (PDB, GenBank, ...) 1920 - 1950s 1970 - 1980s 2016
  • 39. An Outburst of Research Data and Specialization, Results into > 1000 Community Repositories First Social Science Data Archives (ODUM, ICPSR, ...) A wide range of Research Data Repositories First Biomedical Databases (PDB, GenBank, ...) 1920 - 1950s 1970 - 1980s 2016
  • 40. An Outburst of Research Data and Specialization, Results into > 1000 Community Repositories First Social Science Data Archives (ODUM, ICPSR, ...) A wide range of Research Data Repositories First Biomedical Databases (PDB, GenBank, ...) 1500 repositories listed in re3data.org 1920 - 1950s 1970 - 1980s 2016
  • 41. Data Publishing Emerges as the Union of Scholarly Publishing and Data Archiving
  • 42. Data Publishing Emerges as the Union of Scholarly Publishing and Data Archiving Scholarly publishing: Distribute research output
  • 43. Data Publishing Emerges as the Union of Scholarly Publishing and Data Archiving Scholarly publishing: Distribute research output • Attribution and credit
  • 44. Data Publishing Emerges as the Union of Scholarly Publishing and Data Archiving Scholarly publishing: Distribute research output • Attribution and credit • Dissemination
  • 45. Data Publishing Emerges as the Union of Scholarly Publishing and Data Archiving Scholarly publishing: Distribute research output • Attribution and credit • Dissemination • Finding & Reuse
  • 46. Data Publishing Emerges as the Union of Scholarly Publishing and Data Archiving Scholarly publishing: Distribute research output • Attribution and credit • Dissemination • Finding & Reuse Data Archiving: Long-term access to data
  • 47. Data Publishing Emerges as the Union of Scholarly Publishing and Data Archiving Scholarly publishing: Distribute research output • Attribution and credit • Dissemination • Finding & Reuse Data Archiving: Long-term access to data • Accessibility
  • 48. Data Publishing Emerges as the Union of Scholarly Publishing and Data Archiving Scholarly publishing: Distribute research output • Attribution and credit • Dissemination • Finding & Reuse Data Archiving: Long-term access to data • Accessibility • Preservation
  • 49. Data Publishing Emerges as the Union of Scholarly Publishing and Data Archiving Scholarly publishing: Distribute research output • Attribution and credit • Dissemination • Finding & Reuse Data Archiving: Long-term access to data • Accessibility • Preservation • Finding & Reuse
  • 51. Why Data Publishing now? Extending Gross et al. thesis, data publishing accommodates the complexity of research input and output in the digital world.
  • 52. Why Data Publishing now? Extending Gross et al. thesis, data publishing accommodates the complexity of research input and output in the digital world.
  • 53. Why Data Publishing now? • Data (and software) have become common input and output of research Extending Gross et al. thesis, data publishing accommodates the complexity of research input and output in the digital world.
  • 54. Why Data Publishing now? • Data (and software) have become common input and output of research • A scholarly article cannot hold or describe accurately these vast amounts of data and software Extending Gross et al. thesis, data publishing accommodates the complexity of research input and output in the digital world.
  • 55. Why Data Publishing now? • Data (and software) have become common input and output of research • A scholarly article cannot hold or describe accurately these vast amounts of data and software • As input and output of research, data must be citable and accessible to enable validation and reuse, with attribution Extending Gross et al. thesis, data publishing accommodates the complexity of research input and output in the digital world.
  • 56. What is needed for FAIR Data Publishing FAIR = Findable Accessible Interoperable Reusable
  • 57. What is needed for FAIR Data Publishing Data Citation FAIR = Findable Accessible Interoperable Reusable
  • 58. What is needed for FAIR Data Publishing Data Citation • Persistent id to reference data uniquely FAIR = Findable Accessible Interoperable Reusable
  • 59. What is needed for FAIR Data Publishing Data Citation • Persistent id to reference data uniquely • Support for versions and fixity FAIR = Findable Accessible Interoperable Reusable
  • 60. What is needed for FAIR Data Publishing Data Citation • Persistent id to reference data uniquely • Support for versions and fixity • Attribution to authors and repository FAIR = Findable Accessible Interoperable Reusable
  • 61. What is needed for FAIR Data Publishing Data Citation • Persistent id to reference data uniquely • Support for versions and fixity • Attribution to authors and repository Metadata FAIR = Findable Accessible Interoperable Reusable
  • 62. What is needed for FAIR Data Publishing Data Citation • Persistent id to reference data uniquely • Support for versions and fixity • Attribution to authors and repository Metadata • Catalog to discover and locate the data FAIR = Findable Accessible Interoperable Reusable
  • 63. What is needed for FAIR Data Publishing Data Citation • Persistent id to reference data uniquely • Support for versions and fixity • Attribution to authors and repository Metadata • Catalog to discover and locate the data • Sufficient information to understand and reuse the data FAIR = Findable Accessible Interoperable Reusable
  • 64. What is needed for FAIR Data Publishing Data Citation • Persistent id to reference data uniquely • Support for versions and fixity • Attribution to authors and repository Metadata • Catalog to discover and locate the data • Sufficient information to understand and reuse the data Repository FAIR = Findable Accessible Interoperable Reusable
  • 65. What is needed for FAIR Data Publishing Data Citation • Persistent id to reference data uniquely • Support for versions and fixity • Attribution to authors and repository Metadata • Catalog to discover and locate the data • Sufficient information to understand and reuse the data Repository • Digital access to metadata and data FAIR = Findable Accessible Interoperable Reusable
  • 66. What is needed for FAIR Data Publishing Data Citation • Persistent id to reference data uniquely • Support for versions and fixity • Attribution to authors and repository Metadata • Catalog to discover and locate the data • Sufficient information to understand and reuse the data Repository • Digital access to metadata and data • Archive and preservation for long-term access FAIR = Findable Accessible Interoperable Reusable
  • 67. What is needed for FAIR Data Publishing Data Citation • Persistent id to reference data uniquely • Support for versions and fixity • Attribution to authors and repository Metadata • Catalog to discover and locate the data • Sufficient information to understand and reuse the data Repository • Digital access to metadata and data • Archive and preservation for long-term access • Interoperability through standards and APIs FAIR = Findable Accessible Interoperable Reusable
  • 69. A data repository system that serves as a solution for publishing FAIR research data
  • 70. Around the World Dataverse repositories serve a community, an institution, an archive, ...
  • 71. Around the World Harvard Dataverse: Generic data repository open to researchers world wide Dataverse repositories serve a community, an institution, an archive, ...
  • 72. Dataverses contain datasets, datasets contain metadata and data files
  • 73. Data Citation in Dataverse
  • 74. Data Citation in Dataverse Published Year Dataset Title Global Persistent Identifier Repository = Data Publisher Version (or time range) Authors
  • 75. Data Citation Basics Force11, Joint Declaration of Data Citation Principles; Starr et al, 2015
  • 76. Data Citation Basics Force11, Joint Declaration of Data Citation Principles; Starr et al, 2015 The dataset landing page is accessible and guaranteed by the repository (or data publisher), even when data are restricted or deaccessioned
  • 78. Metadata In Dataverse Citation Metadata author, title, repository, year published, version, etc • Dublin Core • DataCite Domain-specific Metadata data collection info (methods, organism, observation, survey, experiment, etc) • DDI (social sciences) • ISA-Tab BioCaddie (biomed) • Virtual Observatory (astro) • + Custom metadata blocks File-level Metadata metadata inside the data file (variables, instrument details, geospatial info, etc) • DDI (for variables), • + more to be determined Fields StandardsMetadata Level
  • 79. Metadata In Dataverse Citation Metadata author, title, repository, year published, version, etc • Dublin Core • DataCite Domain-specific Metadata data collection info (methods, organism, observation, survey, experiment, etc) • DDI (social sciences) • ISA-Tab BioCaddie (biomed) • Virtual Observatory (astro) • + Custom metadata blocks File-level Metadata metadata inside the data file (variables, instrument details, geospatial info, etc) • DDI (for variables), • + more to be determined Fields StandardsMetadata Level
  • 80. Metadata In Dataverse Citation Metadata author, title, repository, year published, version, etc • Dublin Core • DataCite Domain-specific Metadata data collection info (methods, organism, observation, survey, experiment, etc) • DDI (social sciences) • ISA-Tab BioCaddie (biomed) • Virtual Observatory (astro) • + Custom metadata blocks File-level Metadata metadata inside the data file (variables, instrument details, geospatial info, etc) • DDI (for variables), • + more to be determined Fields StandardsMetadata Level
  • 81. Metadata In Dataverse Citation Metadata author, title, repository, year published, version, etc • Dublin Core • DataCite Domain-specific Metadata data collection info (methods, organism, observation, survey, experiment, etc) • DDI (social sciences) • ISA-Tab BioCaddie (biomed) • Virtual Observatory (astro) • + Custom metadata blocks File-level Metadata metadata inside the data file (variables, instrument details, geospatial info, etc) • DDI (for variables), • + more to be determined Fields StandardsMetadata Level DataverseJSONSchema
  • 83. Information Extraction:Tabular Files RData Stata SPSS Excel CSV var 1 var 2 var 3 obs 1 2 a 0 obs 2 4 c 0 obs 3 6 b 1 obs 4 1 e 0 obs 5 2 a 1 obs 6 3 b 1
  • 84. Information Extraction:Tabular Files RData Stata SPSS Excel CSV var 1 var 2 var 3 obs 1 2 a 0 obs 2 4 c 0 obs 3 6 b 1 obs 4 1 e 0 obs 5 2 a 1 obs 6 3 b 1 Variable Metadata: Variable name, label, type, stats, geospatial coordinates
  • 85. Information Extraction:Tabular Files RData Stata SPSS Excel CSV var 1 var 2 var 3 obs 1 2 a 0 obs 2 4 c 0 obs 3 6 b 1 obs 4 1 e 0 obs 5 2 a 1 obs 6 3 b 1 Variable Metadata: Variable name, label, type, stats, geospatial coordinates 2 a 0 4 c 0 6 b 1 1 e 0 2 a 1 3 b 1 DataValues: Independent of format
  • 86. Information Extraction:Tabular Files RData Stata SPSS Excel CSV var 1 var 2 var 3 obs 1 2 a 0 obs 2 4 c 0 obs 3 6 b 1 obs 4 1 e 0 obs 5 2 a 1 obs 6 3 b 1 Variable Metadata: Variable name, label, type, stats, geospatial coordinates 2 a 0 4 c 0 6 b 1 1 e 0 2 a 1 3 b 1 DataValues: Independent of format Universal Numerical Fingerprint (UNF): checksum on data values, from canonical format
  • 89. Information Extraction: FITS (astro) Files Header Metadata: coordinates (R.A., declination), photometric info, ...
  • 90. Information Extraction: FITS (astro) Files Header Metadata: coordinates (R.A., declination), photometric info, ... Data Objects: •Image Files •Spectra •Data cubes •Tables •...
  • 91. In addition to data citation and metadata features, Dataverse has a rich set of features that facilitate data publishing
  • 93. Tiered Access Open (default): CC0 Open Open Click to Download GuestBook Open Open Fill in guestbook before download Terms of Use Open Open Click through terms of use before download Data Restricted Open Restricted Request Access via click through Data Restricted Open Restricted Request Access via application Metadata Files How to Access
  • 94. Tiered Access Open (default): CC0 Open Open Click to Download GuestBook Open Open Fill in guestbook before download Terms of Use Open Open Click through terms of use before download Data Restricted Open Restricted Request Access via click through Data Restricted Open Restricted Request Access via application Metadata Files How to Access
  • 95. Tiered Access Open (default): CC0 Open Open Click to Download GuestBook Open Open Fill in guestbook before download Terms of Use Open Open Click through terms of use before download Data Restricted Open Restricted Request Access via click through Data Restricted Open Restricted Request Access via application Metadata Files How to Access
  • 96. Tiered Access Open (default): CC0 Open Open Click to Download GuestBook Open Open Fill in guestbook before download Terms of Use Open Open Click through terms of use before download Data Restricted Open Restricted Request Access via click through Data Restricted Open Restricted Request Access via application Metadata Files How to Access
  • 97. Tiered Access Open (default): CC0 Open Open Click to Download GuestBook Open Open Fill in guestbook before download Terms of Use Open Open Click through terms of use before download Data Restricted Open Restricted Request Access via click through Data Restricted Open Restricted Request Access via application Metadata Files How to Access
  • 99. Data Publishing Workflows Create Dataset (landing page restricted)
  • 100. Data Publishing Workflows Create Dataset (landing page restricted) Review (collaborators or anonymous reviewers)
  • 101. Data Publishing Workflows Create Dataset (landing page restricted) Publish v. 1 Review (collaborators or anonymous reviewers)
  • 102. Data Publishing Workflows Create Dataset (landing page restricted) Publish v. 1 Review (collaborators or anonymous reviewers) Minor change (metadata only)
  • 103. Data Publishing Workflows Create Dataset (landing page restricted) Publish v. 1 Review (collaborators or anonymous reviewers) Minor change (metadata only)
  • 104. Data Publishing Workflows Create Dataset (landing page restricted) Publish v. 1 Review (collaborators or anonymous reviewers) Minor change (metadata only) Publish v. 1.1
  • 105. Data Publishing Workflows Create Dataset (landing page restricted) Publish v. 1 Review (collaborators or anonymous reviewers) Minor change (metadata only) Publish v. 1.1 Major change (might include new data file)
  • 106. Data Publishing Workflows Create Dataset (landing page restricted) Publish v. 1 Review (collaborators or anonymous reviewers) Minor change (metadata only) Publish v. 1.1 Major change (might include new data file)
  • 107. Data Publishing Workflows Create Dataset (landing page restricted) Publish v. 1 Review (collaborators or anonymous reviewers) Minor change (metadata only) Publish v. 1.1 Major change (might include new data file) Publish v. 2
  • 108. And more at dataverse.org guides ...
  • 109. Biomedical Dataverse addresses data publication of large files: SBGridData
  • 110. The Biomedical Dataverse at Harvard Medical School - also tested as a persistent repository for LINCS data (NIH Library of Integrated Network based Cellular Signatures) Collaboration with Piotr Sliz and Caroline Shamu (HMS) (NIH Library of Integrated Network-based Cellular Signatures)
  • 111. The Biomedical Dataverse at Harvard Medical School - also tested as a persistent repository for LINCS data (NIH Library of Integrated Network based Cellular Signatures) Collaboration with Piotr Sliz and Caroline Shamu (HMS) (NIH Library of Integrated Network-based Cellular Signatures)
  • 112. An additional challenge for data publishing: Sensitive Data
  • 113. “User  Uploads  must  be  void  of  all  iden4fiable   informa4on,  such  that  re-­‐iden4fica4on  of  any  subjects   from  the  amalgama4on  of  the  informa4on  available   from  all  of  the  materials  (across  datasets  and   dataverses)  uploaded  under  any  one  author  and/or   user  should  not  be  possible.”
  • 114. “SubmiCer  represents  and  warrants  that  the  Content   does  not  contain  any  informa4on  (i)  which  iden4fies,  or   which  can  be  used  in  conjunc4on  with  other  publicly   available  informa4on  to  personally  iden4fy,  any   individual;”
  • 115. “If  you  are  submiHng  human  sequences  to  GenBank,   do  not  include  any  data  that  could  reveal  the  personal   iden4ty  of  the  source.  It  is  our  assump4on  that  you   have  received  any  necessary  informed  consent   authoriza4ons  that  your  organiza4ons  require  prior  to   submiHng  your  sequences.” GenBank
  • 116. How can we maximize publishing sensitive data while being mindful of privacy?
  • 117. Sweeney  L,  Crosas  M,  Bar-­‐Sinai  M.  Sharing  Sensi4ve  Data  with  Confidence:  The  DataTags  System.  Technology  Science.  2015101601.   October  16,  2015.  hCp://techscience.org/a/2015101601 The DataTags System
  • 119. A datatag is a set of security features and access requirements for file handling
  • 120. A datatag is a set of security features and access requirements for file handling A datatags repository is one that stores and shares data files in accordance with a standardized and ordered levels of security and access requirements
  • 121. Datatags&Levels& Tag$Type$ Descrip-on$ Security$Features$ Access$Requirements$ Blue$ Public& Clear&storage& Clear&transmission& & Open& Green$ Controlled$ public& Clear&storage& Clear&transmission& Email,&OAuth&verified& registra:on& Yellow$ Accountable& Clear&storage& Encrypted&transmit& Password,&Registered&,& Approval,&Click&DUA& Orange$ More$ accountable& Encrypted&storage& Encrypted&transmit& Password,&Registered,& Approval,&Signed&DUA& Red$ Fully$ accountable& Encrypted&storage& Encrypted&transmit& TwoDfactor&authen:ca:on,& Approval,&Signed&DUA& Crimson$ Maximally$ restricted& Mul:Encrypt&store& Encrypted&transmit& TwoDfactor&authen:ca:on,& Approval,&Signed&DUA&
  • 122. DataTags Workflow in a Dataverse Repository (under development) Data$File$ Inges-on$ Sensi-ve$ Dataset$ Direct$ Access$ Privacy$ Preserving$ Access$ Automa-c$ Interview$$ Review$Board$ Approval$ hCp://datatags.org hCp://privacytools.seas.harvard.edu Two-­‐factor   Authen4ca4on; Signed  DUA
  • 123. Example of DataTags Interview
  • 124. Example of DataTags Interview
  • 125. Example of DataTags Interview
  • 126. Example of DataTags Interview
  • 127. Example of DataTags Interview
  • 128. Example of DataTags Interview
  • 129. Thanks! And join us to this year’s Dataverse Community Meeting
  • 130. References • http://dataverse.org • http://dataverse.harvard.edu • http://datatags.org • Sweeney L, Crosas M, Bar-Sinai M. 2015, Sharing Sensitive Data with Confidence:The DataTags System. Technology Science, hCp://techscience.org/a/2015101601 • Gross Harmon, Reidy, 2001, Communicating Science • Mabe,  2003,  The  Growth  and  Number  of  Journals • Friendly,  2006,  A  Brief  History  of  Data  Visualiza4on