SlideShare a Scribd company logo
Data Publishing Workflows
with Dataverse
Mercè Crosas, Ph.D.
Twitter: @mercecrosas
Director of Data Science
Institute for Quantitative Social Science, Harvard University
MIT, May 6, 2014
Intro to our Data Science
Team and Projects
Data Science at the Institute for Quantitative Social Science
http://datascience.iq.harvard.edu
Combines Expertise
Data Science
Applications
and Tools
Researchers
Software
Engineers
Information
Scientists
Statistical
Innovation
Tool Building &
Computer Science
Data Curation &
Stewardship
With a Team of 20
Mercè Crosas,
Director of Data Science
Statistics and Analytics
James Honaker
Christine Choirat
Vito d’Orazio
Software Development
Gustavo Durand
Robert Treacy
Ellen Kraffmiller
Michael Bar-Sinai
Leonid Andreev
Phil Durbin
Steve Kraffmiller
Xiangqing Yang
Raman Prasad (BARI)
Data Curation and Archiving
Sonia Barbosa
Eleni Castro
Dwayne Liburd
QA
Kevin Condon
Elda Sotiri
Usability and UI
Elizabeth Quigley
Michael Heppler
Gary King,
Director of IQSS
Cris Rothfuss,
Excutive Director
Two widely-Used Frameworks
Developed in the last Decade
A framework that allows analysts to use and interpret
a large body of R statistical models from
heterogeneous contributors through a common
interface.
A data publishing framework that allows researchers
to share, preserve, cite and analyze data, while
keeping control and gaining credit for their data.
New Tools that Integrate with our Initial Work
An interactive web interface that allows users at all
levels of statistical expertise to explore their data and
appropriately construct statistical models.
Integrates with Zelig and Dataverse.
A framework that allows data contributors to set a
level of sensitivity for their dataset based on legal
regulations, which defines how the data can be
stored and shared.
Integrates with Dataverse.
In collaboration with NSF Privacy Tools project
Expanding in other Areas
A web application that assists researchers to discover
new clusters to categorize large document sets,
leveraging all the clustering methods in the literature.
An application that provides a continuous
integration build solution for R packages shared in
Git to archived published code in CRAN.
Support Throughout the Research Cycle
Develop
Quantitative
Methods
Analyze
Quantitative
Datasets
Analyze
Unstructured
Text
Publish Data
Cite Data from
Published Results
Explore,
reanalyze and
reuse dataShare Sensitive Data
Develop > Analyze > Share > Explore > Validate & Reuse
Current Research Interests and Efforts
Reproducible and Reusable Science: “encourage open data and
methodological transparency, and promote and enable data
citation” (with Dataverse, Zelig and SolaFide)
Computationally Assisted Exploration: “with Consilience and SolaFide,
assist researchers to understand and discover new insights from their
data”
Interdisciplinary Quantitative Scientific Scope: “our tools and research
frameworks address broad methodological issues in quantitative
science and are often employed in other domains”
When Data are Not Open: “solutions to preserve privacy, while still
providing science the fundamental ability to learn, access and
replicate findings, with DataTags and PrivateZelig”
Large-Scale Data Sets: “will handle large-scale data sets, as Big Data
science reaches all disciplines: Consilience for millions of text
documents, and Zelig and Dataverse to handle TB-PB-scale data sets.”
Harvard Dataverse
The Harvard Dataverse Repository
 In collaboration with the Harvard Library, Harvard hosts a
Dataverse instance free and open to all researchers.
 It currently holds > 53,000 datasets, with 735,000 files.
 Find or deposit data at: http://thedata.harvard.edu
Collaborations with MIT
 Membership through the Harvard-MIT Data Center (e.g.,
statistics training, access to ICPSR collection)
 The MIT Libraries Dataverse disseminates data purchased
by the MIT Libraries (with Kate McNeill):
 http://thedata.harvard.edu/dvn/dv/mit
 MIT faculty and research groups are already
disseminating their data through the Harvard Dataverse
 Research collaborations (with Micah Altman):
 Integration of Publications with Data (Funded by Sloan):
http://projects.iq.harvard.edu/ojs-dvn
 Privacy Tools for Sharing Research Data (Funded by NSF):
http://privacytools.seas.harvard.edu/
Dataverse 4.0
Target release date: June 23
• New UI
• New rich, faceted search
• New data file ingest
(excel, CSV, R, Stata,
SPSS)
• New metadata for social
sciences, astronomy,
biomedical sciences.
• Integration with SolaFide.
SolaFide Demo
Data Publishing Workflows
Data Publishing Guidelines
Three pillars to Data Publishing:
 A trusted data repository to guarantee long-term access
 A formal data citation*
 Sufficient information to understand and reuse the data
(metadata, documentation, code)
* Data Citation Principles: https://www.force11.org/datacitation
A Rigorous Publishing Workflow
Release Version 1
A Published Dataset
cannot be deleted
(only deaccession, if
legally needed)
Push Version 1.1: small metadata
change; citation doesn’t change
Push Version 2: big metadata
change, or file change; citation
changes
Authors, Title, Year, DOI Repository, UNF, V1
Authors, Title, Year, DOI Repository, UNF, V2
Workflows that Integrate with Journals
1. Publish a dataset to your Dataverse, then provide the
Data Citation to the journal.
2. Contribute to a journal Dataverse:
1. Add dataset to Journal Dataverse as a draft.
2. Journal Editor reviews it, and approves it for release.
3. Dataset is published with Data Citation and link from journal
article to the data.
3. Seamless Integration between journal system and
Dataverse.
OJS and Dataverse Integration
 Sloan funded project to integrate PKP’s Open Journal
System with the Dataverse software.
 Pilot with ~ 50 journals
 OJS Dataverse plugin now available with latest OJS
release
 http://projects.iq.harvard.edu/ojs-dvn
Detailed System Integration
 XML file: AtomPub "entry" with Dublin Core Terms (e.g., title, creator)
 Zip file: All data files associated with that dataset.
 HTTP header "In-Progress: false" to publish datasets.
 Support HTTP verbs: GET, PUT, POST, and DELETE.
 XML file: “Deposit Receipt”
 HTTP status code: 200, 201, 204, 404, 405, 406, 412, 415
Client can query repository (server) any time to get status
Deposit API based on SWORD
 Follows SWORD2 specifications
 SWORD is known and supported within academic
publishing; a “profile” of the AtomPub standard.
 The SWORD project provides client libraries for Python,
Java, Ruby, and PHP:
 OJS uses the PHP client library
 OSF uses the Python client library
 DataUp and DVN-R built a custom Dataverse client
How it differs from SWORD
 Dataverse does not use SWORD download API:
 Use own Data API
 Plan to add this support in the future
 Add XML attribute to pass article citation from client:
 Allow DCterms:isReferencedby to contain attributes such as
HoldingsURI to link back to article from Dataverse
 This is now part of the SWORD PHP client library
 Use “In-Progress: false” to indicate that dataset is ready
to be published (In SWORD spec means deposit
complete)
Support for Metadata Standards
 A core or citation metadata that applies to all datasets –
Supported currently by Data Deposit API
 Extensible metadata blocks for specific domains:
 Social sciences:
 Maps to DDI schema;
 file metadata extracted from tabular data file
 Astronomy:
 Maps to VO schema;
 partially extracted from FITS file
 Biomedical sciences:
 Maps to ISA-tab schema
 Controlled vocabularies maps to EFO, OBI, and Ontology of
Clinical Research
 Extended and managed using SKOS (support taxonomies
within the framework of the semantic web)
Data Publishing Workflows with Dataverse
Data Publishing Workflows with Dataverse
Data Publishing Workflows with Dataverse
Data Publishing Workflows with Dataverse
Data Publishing Workflows with Dataverse
Upcoming
Expanding to Larger and More Types
of Data
 Sharing sensitive data with DataTags and Secure
Dataverse
 Integration with other systems:
 OSF
 DataUp
 WorldMap
 DataBridge
 ORCID
 DASH (at Harvard)
 Expand to Larger data sets
DataTags: For Sharing Sensitive Data
Data Publishing Workflows with Dataverse
THANKS
mcrosas@iq.harvard.edu Twitter: mercecrosas
http://datascience.iq.harvard.edu (Beta)

More Related Content

PPTX
Overview of Bibliometrics - IAP Course version 1.1
PPTX
Bibliometric - MIT MetaResources
PPTX
Using Bibliometrics Tools to Increase the visibility of your publications
PPTX
Bibliometrics Primer
PPTX
Bibliometrics 101
PPTX
FSCI-Friday 4 aug-session one-citing data - ns
PPT
Scopus Overview
PPTX
Neuroscience ppt 2012
Overview of Bibliometrics - IAP Course version 1.1
Bibliometric - MIT MetaResources
Using Bibliometrics Tools to Increase the visibility of your publications
Bibliometrics Primer
Bibliometrics 101
FSCI-Friday 4 aug-session one-citing data - ns
Scopus Overview
Neuroscience ppt 2012

What's hot (20)

PPTX
Introduction to Altmetrics
PDF
Sociology 270 final
PPTX
Scientometrics
PDF
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
PDF
Scopus:Workshops on Scopus for Literature Searching and Research Impact
PPTX
Introduction to Altmetrics for Medical and Special Librarians
PDF
Joining the ‘buzz’ : the role of social media in raising research visibility
PPTX
Citation analysis: State of the art, good practices, and future developments
PPT
Effective search of bibliographic databases
PPT
Altmetrics apples and oranges
PPT
Mendeley Open API
PPT
Pulverer-embo-source data-nfdp13
PPTX
Google Scholar Citations... Own your profile!
PPTX
Are the scientists on to something altmetrics 6 16
PPTX
Scientometric Mapping of Library and Information Science in Web of Science
PPTX
Presentation on Scopus
PPTX
Citation Management Using Mendeley Software
PDF
High quality data publications: drives and needs - Sansone, BDebate, 12 Nov 2014
PDF
Presentation1
ODP
hack4knowledge - Mendeley API
Introduction to Altmetrics
Sociology 270 final
Scientometrics
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
Scopus:Workshops on Scopus for Literature Searching and Research Impact
Introduction to Altmetrics for Medical and Special Librarians
Joining the ‘buzz’ : the role of social media in raising research visibility
Citation analysis: State of the art, good practices, and future developments
Effective search of bibliographic databases
Altmetrics apples and oranges
Mendeley Open API
Pulverer-embo-source data-nfdp13
Google Scholar Citations... Own your profile!
Are the scientists on to something altmetrics 6 16
Scientometric Mapping of Library and Information Science in Web of Science
Presentation on Scopus
Citation Management Using Mendeley Software
High quality data publications: drives and needs - Sansone, BDebate, 12 Nov 2014
Presentation1
hack4knowledge - Mendeley API
Ad

Viewers also liked (6)

PDF
Dataverse 4.0 UX by Elizabeth Quigley
PDF
Dataverse Netowrk Project
PDF
Dataverse: Helping Researchers Publish Their Data Through Automation
PDF
Dataverse in the Universe of Data by Christine L. Borgman
PDF
Dataverse opportunities
 
PDF
APLIC 2014 - Dataverse Project
Dataverse 4.0 UX by Elizabeth Quigley
Dataverse Netowrk Project
Dataverse: Helping Researchers Publish Their Data Through Automation
Dataverse in the Universe of Data by Christine L. Borgman
Dataverse opportunities
 
APLIC 2014 - Dataverse Project
Ad

Similar to Data Publishing Workflows with Dataverse (20)

PPTX
Data Publishing at Harvard's Research Data Access Symposium
PPTX
Networked Science, And Integrating with Dataverse
PPTX
Crediting informatics and data folks in life science teams
PPTX
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
PPTX
Dataverse on the MOC
PDF
Effective research data management
PPTX
Dataverse for Journals
PPTX
Research methods group accelarating impact by sharing data
PPTX
Hughes RDAP11 Data Publication Repositories
PDF
December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...
PPT
David Shotton - Research Integrity: Integrity of the published record
PPTX
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
PDF
Mendeley Data: Enhancing Data Discovery, Sharing and Reuse
PDF
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
PDF
ODIN Final Event - The Care and Feeding of Scientific Data
PPTX
Some Ideas on Making Research Data: "It's the Metadata, stupid!"
PPT
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
PPT
Collaborative Data Analysis with Taverna Workflows
PPTX
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
PPTX
Why would a publisher care about open data?
Data Publishing at Harvard's Research Data Access Symposium
Networked Science, And Integrating with Dataverse
Crediting informatics and data folks in life science teams
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
Dataverse on the MOC
Effective research data management
Dataverse for Journals
Research methods group accelarating impact by sharing data
Hughes RDAP11 Data Publication Repositories
December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...
David Shotton - Research Integrity: Integrity of the published record
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Mendeley Data: Enhancing Data Discovery, Sharing and Reuse
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
ODIN Final Event - The Care and Feeding of Scientific Data
Some Ideas on Making Research Data: "It's the Metadata, stupid!"
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Collaborative Data Analysis with Taverna Workflows
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
Why would a publisher care about open data?

More from Micah Altman (20)

PPTX
Selecting efficient and reliable preservation strategies
PPTX
Well-Being - A Sunset Conversation
PPTX
Matching Uses and Protections for Government Data Releases: Presentation at t...
PPTX
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
PPTX
Well-being A Sunset Conversation
PPTX
Can We Fix Peer Review
PDF
Academy Owned Peer Review
PPTX
Redistricting in the US -- An Overview
PPTX
A Future for Electoral Districting
PPTX
A History of the Internet :Scott Bradner’s Program on Information Science Talk
PPTX
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
PPTX
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
PPTX
Utilizing VR and AR in the Library Space:
PPTX
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
PPTX
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
PDF
Ndsa 2016 opening plenary
PDF
Making Decisions in a World Awash in Data: We’re going to need a different bo...
PPTX
Software Repositories for Research-- An Environmental Scan
PDF
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
PPTX
Gary Price, MIT Program on Information Science
Selecting efficient and reliable preservation strategies
Well-Being - A Sunset Conversation
Matching Uses and Protections for Government Data Releases: Presentation at t...
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
Well-being A Sunset Conversation
Can We Fix Peer Review
Academy Owned Peer Review
Redistricting in the US -- An Overview
A Future for Electoral Districting
A History of the Internet :Scott Bradner’s Program on Information Science Talk
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
Utilizing VR and AR in the Library Space:
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
Ndsa 2016 opening plenary
Making Decisions in a World Awash in Data: We’re going to need a different bo...
Software Repositories for Research-- An Environmental Scan
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
Gary Price, MIT Program on Information Science

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Big Data Technologies - Introduction.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
KodekX | Application Modernization Development
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
cuic standard and advanced reporting.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Advanced IT Governance
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The AUB Centre for AI in Media Proposal.docx
NewMind AI Monthly Chronicles - July 2025
Big Data Technologies - Introduction.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
KodekX | Application Modernization Development
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Diabetes mellitus diagnosis method based random forest with bat algorithm
cuic standard and advanced reporting.pdf
Empathic Computing: Creating Shared Understanding
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Advanced IT Governance
“AI and Expert System Decision Support & Business Intelligence Systems”
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Understanding_Digital_Forensics_Presentation.pptx
Machine learning based COVID-19 study performance prediction
Dropbox Q2 2025 Financial Results & Investor Presentation
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Data Publishing Workflows with Dataverse

  • 1. Data Publishing Workflows with Dataverse Mercè Crosas, Ph.D. Twitter: @mercecrosas Director of Data Science Institute for Quantitative Social Science, Harvard University MIT, May 6, 2014
  • 2. Intro to our Data Science Team and Projects
  • 3. Data Science at the Institute for Quantitative Social Science http://datascience.iq.harvard.edu
  • 4. Combines Expertise Data Science Applications and Tools Researchers Software Engineers Information Scientists Statistical Innovation Tool Building & Computer Science Data Curation & Stewardship
  • 5. With a Team of 20 Mercè Crosas, Director of Data Science Statistics and Analytics James Honaker Christine Choirat Vito d’Orazio Software Development Gustavo Durand Robert Treacy Ellen Kraffmiller Michael Bar-Sinai Leonid Andreev Phil Durbin Steve Kraffmiller Xiangqing Yang Raman Prasad (BARI) Data Curation and Archiving Sonia Barbosa Eleni Castro Dwayne Liburd QA Kevin Condon Elda Sotiri Usability and UI Elizabeth Quigley Michael Heppler Gary King, Director of IQSS Cris Rothfuss, Excutive Director
  • 6. Two widely-Used Frameworks Developed in the last Decade A framework that allows analysts to use and interpret a large body of R statistical models from heterogeneous contributors through a common interface. A data publishing framework that allows researchers to share, preserve, cite and analyze data, while keeping control and gaining credit for their data.
  • 7. New Tools that Integrate with our Initial Work An interactive web interface that allows users at all levels of statistical expertise to explore their data and appropriately construct statistical models. Integrates with Zelig and Dataverse. A framework that allows data contributors to set a level of sensitivity for their dataset based on legal regulations, which defines how the data can be stored and shared. Integrates with Dataverse. In collaboration with NSF Privacy Tools project
  • 8. Expanding in other Areas A web application that assists researchers to discover new clusters to categorize large document sets, leveraging all the clustering methods in the literature. An application that provides a continuous integration build solution for R packages shared in Git to archived published code in CRAN.
  • 9. Support Throughout the Research Cycle Develop Quantitative Methods Analyze Quantitative Datasets Analyze Unstructured Text Publish Data Cite Data from Published Results Explore, reanalyze and reuse dataShare Sensitive Data Develop > Analyze > Share > Explore > Validate & Reuse
  • 10. Current Research Interests and Efforts Reproducible and Reusable Science: “encourage open data and methodological transparency, and promote and enable data citation” (with Dataverse, Zelig and SolaFide) Computationally Assisted Exploration: “with Consilience and SolaFide, assist researchers to understand and discover new insights from their data” Interdisciplinary Quantitative Scientific Scope: “our tools and research frameworks address broad methodological issues in quantitative science and are often employed in other domains” When Data are Not Open: “solutions to preserve privacy, while still providing science the fundamental ability to learn, access and replicate findings, with DataTags and PrivateZelig” Large-Scale Data Sets: “will handle large-scale data sets, as Big Data science reaches all disciplines: Consilience for millions of text documents, and Zelig and Dataverse to handle TB-PB-scale data sets.”
  • 12. The Harvard Dataverse Repository  In collaboration with the Harvard Library, Harvard hosts a Dataverse instance free and open to all researchers.  It currently holds > 53,000 datasets, with 735,000 files.  Find or deposit data at: http://thedata.harvard.edu
  • 13. Collaborations with MIT  Membership through the Harvard-MIT Data Center (e.g., statistics training, access to ICPSR collection)  The MIT Libraries Dataverse disseminates data purchased by the MIT Libraries (with Kate McNeill):  http://thedata.harvard.edu/dvn/dv/mit  MIT faculty and research groups are already disseminating their data through the Harvard Dataverse  Research collaborations (with Micah Altman):  Integration of Publications with Data (Funded by Sloan): http://projects.iq.harvard.edu/ojs-dvn  Privacy Tools for Sharing Research Data (Funded by NSF): http://privacytools.seas.harvard.edu/
  • 14. Dataverse 4.0 Target release date: June 23 • New UI • New rich, faceted search • New data file ingest (excel, CSV, R, Stata, SPSS) • New metadata for social sciences, astronomy, biomedical sciences. • Integration with SolaFide.
  • 17. Data Publishing Guidelines Three pillars to Data Publishing:  A trusted data repository to guarantee long-term access  A formal data citation*  Sufficient information to understand and reuse the data (metadata, documentation, code) * Data Citation Principles: https://www.force11.org/datacitation
  • 18. A Rigorous Publishing Workflow Release Version 1 A Published Dataset cannot be deleted (only deaccession, if legally needed) Push Version 1.1: small metadata change; citation doesn’t change Push Version 2: big metadata change, or file change; citation changes Authors, Title, Year, DOI Repository, UNF, V1 Authors, Title, Year, DOI Repository, UNF, V2
  • 19. Workflows that Integrate with Journals 1. Publish a dataset to your Dataverse, then provide the Data Citation to the journal. 2. Contribute to a journal Dataverse: 1. Add dataset to Journal Dataverse as a draft. 2. Journal Editor reviews it, and approves it for release. 3. Dataset is published with Data Citation and link from journal article to the data. 3. Seamless Integration between journal system and Dataverse.
  • 20. OJS and Dataverse Integration  Sloan funded project to integrate PKP’s Open Journal System with the Dataverse software.  Pilot with ~ 50 journals  OJS Dataverse plugin now available with latest OJS release  http://projects.iq.harvard.edu/ojs-dvn
  • 21. Detailed System Integration  XML file: AtomPub "entry" with Dublin Core Terms (e.g., title, creator)  Zip file: All data files associated with that dataset.  HTTP header "In-Progress: false" to publish datasets.  Support HTTP verbs: GET, PUT, POST, and DELETE.  XML file: “Deposit Receipt”  HTTP status code: 200, 201, 204, 404, 405, 406, 412, 415 Client can query repository (server) any time to get status
  • 22. Deposit API based on SWORD  Follows SWORD2 specifications  SWORD is known and supported within academic publishing; a “profile” of the AtomPub standard.  The SWORD project provides client libraries for Python, Java, Ruby, and PHP:  OJS uses the PHP client library  OSF uses the Python client library  DataUp and DVN-R built a custom Dataverse client
  • 23. How it differs from SWORD  Dataverse does not use SWORD download API:  Use own Data API  Plan to add this support in the future  Add XML attribute to pass article citation from client:  Allow DCterms:isReferencedby to contain attributes such as HoldingsURI to link back to article from Dataverse  This is now part of the SWORD PHP client library  Use “In-Progress: false” to indicate that dataset is ready to be published (In SWORD spec means deposit complete)
  • 24. Support for Metadata Standards  A core or citation metadata that applies to all datasets – Supported currently by Data Deposit API  Extensible metadata blocks for specific domains:  Social sciences:  Maps to DDI schema;  file metadata extracted from tabular data file  Astronomy:  Maps to VO schema;  partially extracted from FITS file  Biomedical sciences:  Maps to ISA-tab schema  Controlled vocabularies maps to EFO, OBI, and Ontology of Clinical Research  Extended and managed using SKOS (support taxonomies within the framework of the semantic web)
  • 31. Expanding to Larger and More Types of Data  Sharing sensitive data with DataTags and Secure Dataverse  Integration with other systems:  OSF  DataUp  WorldMap  DataBridge  ORCID  DASH (at Harvard)  Expand to Larger data sets
  • 32. DataTags: For Sharing Sensitive Data