SlideShare a Scribd company logo
Data Sharing & Data Citation Micah Altman, Institute for Quantitative Social Science, Harvard University Prepared for  data coding, analysis, archiving, and sharing for open collaboration NSF Sept 15-16, 2011
Collaborators* Margaret Adams, George Alter, Leonid Andreev, Ed Bachman,  Adam Buchbinder,  Ken Bollen, Bryan Beecher, Steve Burling, Tom Carsey, Kevin Condon, Jonathan Crabtree, Merce Crosas, Darrell Donakowski, Myron Guttman, Gary King, Patrick King, Tom Lipkis, Freeman Lo, Jared Lyle, Marc Maynard, Nancy McGovern, Amy Pienta, Lois Timms-Ferrarra,  Akio Sone, Bob Treacy, Copeland Young Research Support Thanks to the Library of Congress (PA#NDP03-1), the National Science Foundation (DMS-0835500, SES 0112072), IMLS (LG-05-09-0041-09),  the Harvard University Library, the Institute for Quantitative Social Science, the Harvard-MIT Data Center, and the Murray Research Archive.  Data Sharing & Data Citation * And co-conspirators
Related Work Altman, M., and J. Crabtree, 2011.  “Using the SafeArchive System: TRAC-Based Auditing of LOCKSS”,  Proceedings of Archiving 2011.  M. Crosas,  2011,  “The Dataverse Network: An Open-Source Application for Sharing, Discovering and Preserving Data”,  D-Lib Magazine  17(1/2).  M. Altman, Adams, M., Crabtree, J., Donakowski, D., Maynard, M., Pienta, A., & Young, C. 2009. "Digital preservation through archival collaboration: The Data Preservation Alliance for the Social Sciences."  The American Archivist . 72(1): 169-182 Gutmann,M.  Abrahamson, M, Adams, M.O., Altman, M, Arms, C., Bollen, K., Carlson, M., Crabtree, J., Donakowski, D., King, G., Lyle, J., Maynard, M., Pienta, A., Rockwell, R, Timms-Ferrara L., Young, C., 2009. "From Preserving the Past to Preserving the Future: The Data-PASS Project and the challenges of preserving digital social science data",  Library Trends  57(3):315-33 M. Altman, 2008,  "A Fingerprint Method for Verification of Scientific Data" in,  Advances in Systems, Computing Sciences and Software Engineering , (Proceedings of the International Conference on Systems, Computing Sciences and Software Engineering 2007) , Springer Verlag. M. Altman and G. King. 2007. “A Proposed Standard for the Scholarly Citation of Quantitative Data”, D-Lib, 13, 3/4 (March/April). G. King, 2007, " An Introduction to the Dataverse Network as an Infrastructure for Data Sharing",  Sociological Methods and Research , Vol. 32, No. 2, pp. 173-199 Data Sharing & Data Citation
Motivations Data Sharing & Data Citation
Access to Data is the Foundation of Science Science is not (only) about being scientific Scientific progress requires community:  competition and collaboration in the pursuit of common goals Without access to the same materials:  no community exists … data is the nucleus of scientific collaboration The value of an article that can’t be replicated: ? Scholarly articles are summaries, not the actual research results Experimental expensive to reproduce, observational data impossible Hard for journal editors to verify --  If you find it, how do you know it’s the same? Replication projects show: many published articles cannot be replicated … data is needed for scientific replication Data Sharing & Data Citation Sources: Fienberg et. al 1985; ICSU 2004; Nature 2009
Open Data Broadens & Deepens Impact Data Intensive Science Increased opportunities for interdisciplinarity Science modeling across multiple scales Continuous, complete, fine-grained information on physical processes, systems, human behavior Education Data eases transition from education to research Open Data Democratizes Science Citizen-scientist Developing countries Researchers outside of inner-circle of institution Crowd-sourcing, open notebooks, and mashups Data Sharing & Data Citation & Data Sharing Increases Publication Impact [Gleditsch 2003;  Wilson 2008; Piowar 2007]
Data is Key to Government Statistics = state-istics Reformers use data To assess the performance of the state To assess social conditions  Governments attempt to  control access to data to evade accountability Policy debates often centers on data War on poverty, civil rights, consumer protection – all made heavy use of statistical arguments Economic, environment policies are data-intensive Data access brings together both sides of political spectrum In modern democracy the public needs a direct source of information Liberals and conservatives support access to data informing policy Data Sharing & Data Citation Source: “Propaganda”  http://www.media-studies.ca/articles/images/berlin_wall.jpg Sources: Gough 2003; Shulman 2006; Wagner & Steinzor 2006;  Alonzo and Starr 1988
Open Data is “Research Insurance” Keeps open option to after nominal end of project – extends lifecycle Continuation projects Publication revisions Broader research programs Insures against loss of “project memory” Departure of a senior personnel from institution Departure of post-docs, graduate students from students Accidental loss of data due to local IT failures Reduces questions from secondary analysts Insures against intentional and unintentional errors All collaborators can verify results prior to publication Enables more intensive peer review Data Sharing & Data Citation Source: Berman, et. al 2008.
Data Sharing Across Communities Data sharing practices vary greatly across communities Proprietary Formal sharing Formal deposit Significant correlates: Tacit knowledge, Individual investment of time in data collection, confidentiality, journal practices, funder policies & practices [Micah Altman, 10/6/2009] Open Data Source: R.I.N. 2008 also see Borgman 2007;  Niu 2006
So when do things go wrong? Source: Reich & Rosenthal 2005
Confidentiality Restrictions for Personal Private Information Overlapping laws differ: People/subjects covered Organizations covered Required technical and procedural controls Definition of identifiability Some Strategies Consent for sharing up front Commercialize Observe public activity Share aggregates only De-identify Recent Statistical Results (Oversimplified    ) De-identification often leaks Aggregation sometimes leaks Not included : EU directives, foreign laws,  ANPRM Request for Comment on proposed revisions to 45 CFR 46 www.hhs.gov/ohrp/humansubjects/anprm2011page.html
Integrating Tools Data Sharing & Data Citation
Data Management - Goals Data Sharing & Data Citation
Data Management Elements Data Sharing & Data Citation
Core Requirements for Data Sharing Infrastructure Stakeholder incentives  recognition; citation; payment; compliance; services Dissemination access to metadata; documentation; data  Access control authentication; authorization; rights management Provenance chain of control; verification  of metadata,  bits, semantic content Persistence bits; semantic content; use Legal protection rights management;  consent; record keeping; auditing Usability discovery; deposit; curation; administration; collaboration Business model Data Sharing & Data Citation Sources: King 2007; ICSU 2004; NSB 2005
Why is Infrastructure for Data Sharing Necessary? Accessibility: Many large data sets: in public archives Most data in published articles:  not accessible, results not replicable without the original author Most data sets from federal grants: not publicly available Problems with discovery and linking even with professional archives: Data in different archives have different identifiers Archives change identifiers, links Changes to data are made; identifiers are reused or removed; old data are lost Locating/browsing/extracting requires specialized tools & approaches Sharing data requires exposing tacit knowledge Explicit documentation of data structure, collection process, interpretation Harmonizing/linking to known ontologies, metadata schemas, vocabularies Data sets are not preserved like books Static data files (even if on the web): unreadable after a few years When storage methods change: some data sets are lost; others have altered content! Why not Single Centralized infrastructure ? Single point of failure Difficult when data are heterogeneous in format, origin, size, effort needed to collect or analyze, legal access rules, etc. Data producers want credit, control, and visibility Data Sharing & Data Citation
Dataverse For Organizations For Scholars Brand it like your own website. Upload any type of data. Establish a persistent data citation Facilitate data discovery Provide live analysis  Receive permanent storage space Used by archives, libraries, journals, schools Enable contributors to upload data Organize studies by collections Search across a universe of data Control access and terms of use Federate with catalogs and partners: 
OAI-PMH, LOCKSS, Z39.50, DDI Gateway to over 39000 social science studies (world’s largest catalog) Web Virtual Hosting 2.0 Service -- Over 350 virtual archives Federated search and delivery
Virtual Archive: Scholar Site Scholar retains control over branding and dissemination Preservation and long-term access is guaranteed Dissemination and compliance with Data Manage Plans is verifiable Integrates with OpenScholar Data Sharing & Data Citation
Interoperability & Integration
Mind the Gaps GAP: Coverage across entire lifecycle   -- decoupling of dissemination, formal publication, long-term access, reuse GAP: Interoperability and integration across tools  GAP: Maturity and sustainability of tools --- most tools have small communities of maintainers, particular worrisome w/lack of interoperability Data Sharing & Data Citation design publishing dissemination archiving reuse collection processing integration analysis cati / capi Enhanced publication (sweave) identifiers  Google-__________ data archives, hosting, networks General digital libraries and repositories Scientific workflow systems
Supporting Institutions Data Sharing & Data Citation
Institutional Data Access Strategies* “ Ignore it, maybe someone else will take care of it”  (internet archive, …) “ We’ll always be here” (self-preservation) Let the publishers do It “ We are ever true to [Insert Alma Mater]” (institutional archives) “ Ask us (domain archive) to do it” (ICPSR, MRA, Roper, …) “ Ask someone(s) else do it” (Data-PASS, Meta-Archive, ClockSS) “ Trust No One” (LOCKSS) Data Sharing & Data Citation *All quotes are entirely fictional :-)
Institutional Preservation Strategies -- Corollaries There are potential single points of failure in both technology, organization and legal regimes: Diversify your portfolio:  multiple software systems, hardware, organization (e.g., Data-PASS :-) Seek international partners Many combinations of preservation & dissemination strategies are compatible: Layer technologies and strategies Leverage dissemination (in a planned way) for preservation  (and vice-versa) Preservation is impossible to demonstrate conclusively: Consider organizational credentials No organization is absolutely certain to be reliable Data Sharing & Data Citation
Partnership Agreements MOU Secession Plans & Agreements Coordinating Operations  Development of shared procedures Joint  “ Not-bad ”  practices Identification & selection Metadata Confidentiality Shared Catalog Unified Discovery Content replication Data-PASS is a broad-based partnership of data archives dedicated to acquiring and preserving data at-risk of being lost to the social science research community. Data-PASS partners have rescued thousands of data sets and created the largest catalog of social science data in existence. Data-PASS partners collaborate to identify and promote good archival practices, seek out at-risk research data, build preservation infrastructure, and mutually safeguard Data Sharing & Data Citation
Ideal integration of policy and technology?  Expressed in high-level domain/business language Captures a significant portion of business domain Translated to a formal schematization Automatically measurable Directly controls procedures & actions to achieve compliance Verifiable translation from business domain policy  Data Sharing & Data Citation Policy: A set of rules and objectives expressed at a high level domain that controls actions at a lower level
Data Sharing & Data Citation “ The repository system must be able to identify the number of copies of all stored digital objects, and the location of each object and their copies.” Policy Schematization Behavior (Operationalization)
SafeArchive:  TRAC-Based Management of LOCKSS  Facilitating collaborative replication and preservation with technology…  Collaborators  declare explicit non-uniform resource commitments Policy  records commitments, storage network properties Storage layer  provides replication, integrity, freshness, versioning  SafeArchive software  provides monitoring, auditing, and provisioning  Content  is harvested through HTTP (LOCKSS) or OAI-PMH Integration of  LOCKSS, The Dataverse Network, TRAC Data Sharing & Data Citation
Aligning Incentives Data Sharing & Data Citation
Stakeholders & Information Flow Data Sharing & Data Citation Data Collection Publication of  Research Products
Data Citation as a Leverage Point  Services Identifiers to specific fixed versions of data are needed to establish unambiguous chains of  provenance Identifiers that can be globally resolved to machine-understandable metadata and to identified object are needed to building generalized  access  and analysis services  Persistence of identifiers are needed to maintain  long-term access  Incentives Scholarly credit (intellectual attribution) is a large motivator for many researchers  – citation creates incentive for researchers to publish data Scholars also comply with  enforceable  journal policies -- requiring data citation is a light-weight method to make data access policies auditable Impact/usage is a motivator for public research funders – data citation provides foundation for measures of usage and impact Data Sharing & Data Citation
Data Sharing & Data Citation Common Principles
Data Sharing & Data Citation
Thanks to 37 Participants Data Sharing & Data Citation
What is a citation? Data Sharing & Data Citation
Data Sharing & Data Citation
Workflow Data Sharing & Data Citation
Workflow Data Sharing & Data Citation
-  Separate scientific principles, use cases, requirements Distinguish syntax, semantics, from presentation Design for ecosystem & lifecycle Incremental value for incremental effort - Think Globally, Act Locally  Design Principles Data Sharing & Data Citation
Theory Data Sharing & Data Citation
Theory + Data Sharing & Data Citation Data citations should be first class objects for publication -- appear with citation; should be as easy to cite as other works At minimum, all data necessary to understand assess extend conclusions in scholarly work should be cited Citations should persist and enable access to fixed version of data at least as long as citing work Data citation should support unambiguous attribution of credit to all contributors, possibly through the citation ecosystem
Theory + Practice Data Sharing & Data Citation
Use Cases Data Sharing & Data Citation
Use Cases (details) Data Sharing & Data Citation Operational Constraints? -Syntax -Interoperability -Technical contexts of use
Actors Data Sharing & Data Citation
Semantic : Persistent ID, Author, Title, Version (or at least date) Presentation : Any style Grouped with other references Actionable in context Policy Treat  data cites as first class If its needed support a claim, cite it Offer credit to contributors Simple Proposal Data Sharing & Data Citation
We cannot depend on a single tool -- plans for integration and interoperability through  citations  and linking mechanisms, interchange formats, ontology hooks, protocols ? Large portion of benefit from data sharing arises from  open   access … -- how can OpenShare “nudge” researchers toward Open Data? Individual researchers cannot ensure  long-term  access  -- how will OpenShapa fit in institutional ecosystem? Discussion
Contact Micah Altman futurelib.org Data Sharing & Data Citation

More Related Content

PPT
Linking Data to Publications through Citation and Virtual Archives
PPTX
"Reproducibility from the Informatics Perspective"
PPTX
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
PPTX
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
PPTX
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
PPTX
State of the Art Informatics for Research Reproducibility, Reliability, and...
PPTX
Meeting Federal Research Requirements
PPTX
Managing Confidential Information – Trends and Approaches
Linking Data to Publications through Citation and Virtual Archives
"Reproducibility from the Informatics Perspective"
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
State of the Art Informatics for Research Reproducibility, Reliability, and...
Meeting Federal Research Requirements
Managing Confidential Information – Trends and Approaches

What's hot (20)

PPTX
Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...
PPTX
Data Management Planning for researchers
PPTX
Next generation data services at the Marriott Library
PDF
Levine - Data Curation; Ethics and Legal Considerations
PPTX
ICPSR Data Exploration Tools
PPTX
Instructional Data Sets from Q-step Launch Event (Univ of Exeter) 3-20-2014
PPTX
Reproducibility from an infomatics perspective
PDF
Why Data Citation Currently Misses the Point
PPTX
DataONE Education Module 10: Legal and Policy Issues
PPTX
Managing confidential data
PDF
Poster RDAP13: Research Data in eCommons @ Cornell: Present and Future
PPTX
From Data Sharing to Data Stewardship
PPTX
Uc3 pasig-asis&t-2013-08-20-support-of-data-intensive-research
PDF
Who owns the data? Intellectual property considerations for academic research...
PPTX
Understanding ICPSR - An Orientation and Tours of ICPSR Data Services and Edu...
PDF
Poster: Very Open Data Project
PPTX
Niso library law
PDF
Data Policy for Open Science
PPTX
Shareable by Design: Making Better Use of your Research
PPTX
DCC and FAIR initiatives
Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...
Data Management Planning for researchers
Next generation data services at the Marriott Library
Levine - Data Curation; Ethics and Legal Considerations
ICPSR Data Exploration Tools
Instructional Data Sets from Q-step Launch Event (Univ of Exeter) 3-20-2014
Reproducibility from an infomatics perspective
Why Data Citation Currently Misses the Point
DataONE Education Module 10: Legal and Policy Issues
Managing confidential data
Poster RDAP13: Research Data in eCommons @ Cornell: Present and Future
From Data Sharing to Data Stewardship
Uc3 pasig-asis&t-2013-08-20-support-of-data-intensive-research
Who owns the data? Intellectual property considerations for academic research...
Understanding ICPSR - An Orientation and Tours of ICPSR Data Services and Edu...
Poster: Very Open Data Project
Niso library law
Data Policy for Open Science
Shareable by Design: Making Better Use of your Research
DCC and FAIR initiatives
Ad

Similar to Data Sharing & Data Citation (20)

PPTX
DataONE Education Module 02: Data Sharing
PPTX
La ricerca scientifica nell'era dei Big Data - Sabina Leonelli
PDF
Open Data is Not Enough: Making Data Sharing Work
PPTX
A Big Picture in Research Data Management
PPTX
Data as a research output and a research asset: the case for Open Science/Sim...
PPTX
Introduction to research data management
PDF
You down with dmp yeah you know me!
PPTX
Findable, Accessible, Interoperable and Reusable (FAIR) data
PPTX
FAIR for the future: embracing all things data
PDF
Data Policy for Open Science
PPTX
NIH Data Summit - The NIH Data Commons
PPTX
Privacy in Research Data Managemnt - Use Cases
PDF
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
PDF
03 keynote dillo
PPT
Data management plans
PPTX
Open Science Globally: Some Developments/Dr Simon Hodson
PDF
Overview of standards/stakeholders in life science (RDA Engagement Interest G...
PPTX
Meeting the NSF DMP Requirement June 13, 2012
PPTX
Emerging Data Citation Infrastructure
PDF
big-data-and-data-sharing_ethical-issues.pdf
DataONE Education Module 02: Data Sharing
La ricerca scientifica nell'era dei Big Data - Sabina Leonelli
Open Data is Not Enough: Making Data Sharing Work
A Big Picture in Research Data Management
Data as a research output and a research asset: the case for Open Science/Sim...
Introduction to research data management
You down with dmp yeah you know me!
Findable, Accessible, Interoperable and Reusable (FAIR) data
FAIR for the future: embracing all things data
Data Policy for Open Science
NIH Data Summit - The NIH Data Commons
Privacy in Research Data Managemnt - Use Cases
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
03 keynote dillo
Data management plans
Open Science Globally: Some Developments/Dr Simon Hodson
Overview of standards/stakeholders in life science (RDA Engagement Interest G...
Meeting the NSF DMP Requirement June 13, 2012
Emerging Data Citation Infrastructure
big-data-and-data-sharing_ethical-issues.pdf
Ad

More from Micah Altman (20)

PPTX
Selecting efficient and reliable preservation strategies
PPTX
Well-Being - A Sunset Conversation
PPTX
Matching Uses and Protections for Government Data Releases: Presentation at t...
PPTX
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
PPTX
Well-being A Sunset Conversation
PPTX
Can We Fix Peer Review
PDF
Academy Owned Peer Review
PPTX
Redistricting in the US -- An Overview
PPTX
A Future for Electoral Districting
PPTX
A History of the Internet :Scott Bradner’s Program on Information Science Talk
PPTX
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
PPTX
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
PPTX
Utilizing VR and AR in the Library Space:
PPTX
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
PPTX
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
PDF
Ndsa 2016 opening plenary
PDF
Making Decisions in a World Awash in Data: We’re going to need a different bo...
PPTX
Software Repositories for Research-- An Environmental Scan
PDF
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
PPTX
Gary Price, MIT Program on Information Science
Selecting efficient and reliable preservation strategies
Well-Being - A Sunset Conversation
Matching Uses and Protections for Government Data Releases: Presentation at t...
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
Well-being A Sunset Conversation
Can We Fix Peer Review
Academy Owned Peer Review
Redistricting in the US -- An Overview
A Future for Electoral Districting
A History of the Internet :Scott Bradner’s Program on Information Science Talk
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
Utilizing VR and AR in the Library Space:
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
Ndsa 2016 opening plenary
Making Decisions in a World Awash in Data: We’re going to need a different bo...
Software Repositories for Research-- An Environmental Scan
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
Gary Price, MIT Program on Information Science

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPT
Teaching material agriculture food technology
PDF
Encapsulation theory and applications.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
cuic standard and advanced reporting.pdf
PPTX
MYSQL Presentation for SQL database connectivity
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Spectral efficient network and resource selection model in 5G networks
Per capita expenditure prediction using model stacking based on satellite ima...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Digital-Transformation-Roadmap-for-Companies.pptx
Review of recent advances in non-invasive hemoglobin estimation
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Big Data Technologies - Introduction.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Teaching material agriculture food technology
Encapsulation theory and applications.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Machine learning based COVID-19 study performance prediction
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Understanding_Digital_Forensics_Presentation.pptx
cuic standard and advanced reporting.pdf
MYSQL Presentation for SQL database connectivity
The AUB Centre for AI in Media Proposal.docx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Spectral efficient network and resource selection model in 5G networks

Data Sharing & Data Citation

  • 1. Data Sharing & Data Citation Micah Altman, Institute for Quantitative Social Science, Harvard University Prepared for data coding, analysis, archiving, and sharing for open collaboration NSF Sept 15-16, 2011
  • 2. Collaborators* Margaret Adams, George Alter, Leonid Andreev, Ed Bachman, Adam Buchbinder, Ken Bollen, Bryan Beecher, Steve Burling, Tom Carsey, Kevin Condon, Jonathan Crabtree, Merce Crosas, Darrell Donakowski, Myron Guttman, Gary King, Patrick King, Tom Lipkis, Freeman Lo, Jared Lyle, Marc Maynard, Nancy McGovern, Amy Pienta, Lois Timms-Ferrarra, Akio Sone, Bob Treacy, Copeland Young Research Support Thanks to the Library of Congress (PA#NDP03-1), the National Science Foundation (DMS-0835500, SES 0112072), IMLS (LG-05-09-0041-09), the Harvard University Library, the Institute for Quantitative Social Science, the Harvard-MIT Data Center, and the Murray Research Archive. Data Sharing & Data Citation * And co-conspirators
  • 3. Related Work Altman, M., and J. Crabtree, 2011. “Using the SafeArchive System: TRAC-Based Auditing of LOCKSS”, Proceedings of Archiving 2011. M. Crosas, 2011, “The Dataverse Network: An Open-Source Application for Sharing, Discovering and Preserving Data”, D-Lib Magazine 17(1/2). M. Altman, Adams, M., Crabtree, J., Donakowski, D., Maynard, M., Pienta, A., & Young, C. 2009. "Digital preservation through archival collaboration: The Data Preservation Alliance for the Social Sciences." The American Archivist . 72(1): 169-182 Gutmann,M. Abrahamson, M, Adams, M.O., Altman, M, Arms, C., Bollen, K., Carlson, M., Crabtree, J., Donakowski, D., King, G., Lyle, J., Maynard, M., Pienta, A., Rockwell, R, Timms-Ferrara L., Young, C., 2009. "From Preserving the Past to Preserving the Future: The Data-PASS Project and the challenges of preserving digital social science data", Library Trends 57(3):315-33 M. Altman, 2008, "A Fingerprint Method for Verification of Scientific Data" in, Advances in Systems, Computing Sciences and Software Engineering , (Proceedings of the International Conference on Systems, Computing Sciences and Software Engineering 2007) , Springer Verlag. M. Altman and G. King. 2007. “A Proposed Standard for the Scholarly Citation of Quantitative Data”, D-Lib, 13, 3/4 (March/April). G. King, 2007, " An Introduction to the Dataverse Network as an Infrastructure for Data Sharing", Sociological Methods and Research , Vol. 32, No. 2, pp. 173-199 Data Sharing & Data Citation
  • 4. Motivations Data Sharing & Data Citation
  • 5. Access to Data is the Foundation of Science Science is not (only) about being scientific Scientific progress requires community: competition and collaboration in the pursuit of common goals Without access to the same materials: no community exists … data is the nucleus of scientific collaboration The value of an article that can’t be replicated: ? Scholarly articles are summaries, not the actual research results Experimental expensive to reproduce, observational data impossible Hard for journal editors to verify -- If you find it, how do you know it’s the same? Replication projects show: many published articles cannot be replicated … data is needed for scientific replication Data Sharing & Data Citation Sources: Fienberg et. al 1985; ICSU 2004; Nature 2009
  • 6. Open Data Broadens & Deepens Impact Data Intensive Science Increased opportunities for interdisciplinarity Science modeling across multiple scales Continuous, complete, fine-grained information on physical processes, systems, human behavior Education Data eases transition from education to research Open Data Democratizes Science Citizen-scientist Developing countries Researchers outside of inner-circle of institution Crowd-sourcing, open notebooks, and mashups Data Sharing & Data Citation & Data Sharing Increases Publication Impact [Gleditsch 2003; Wilson 2008; Piowar 2007]
  • 7. Data is Key to Government Statistics = state-istics Reformers use data To assess the performance of the state To assess social conditions Governments attempt to control access to data to evade accountability Policy debates often centers on data War on poverty, civil rights, consumer protection – all made heavy use of statistical arguments Economic, environment policies are data-intensive Data access brings together both sides of political spectrum In modern democracy the public needs a direct source of information Liberals and conservatives support access to data informing policy Data Sharing & Data Citation Source: “Propaganda” http://www.media-studies.ca/articles/images/berlin_wall.jpg Sources: Gough 2003; Shulman 2006; Wagner & Steinzor 2006; Alonzo and Starr 1988
  • 8. Open Data is “Research Insurance” Keeps open option to after nominal end of project – extends lifecycle Continuation projects Publication revisions Broader research programs Insures against loss of “project memory” Departure of a senior personnel from institution Departure of post-docs, graduate students from students Accidental loss of data due to local IT failures Reduces questions from secondary analysts Insures against intentional and unintentional errors All collaborators can verify results prior to publication Enables more intensive peer review Data Sharing & Data Citation Source: Berman, et. al 2008.
  • 9. Data Sharing Across Communities Data sharing practices vary greatly across communities Proprietary Formal sharing Formal deposit Significant correlates: Tacit knowledge, Individual investment of time in data collection, confidentiality, journal practices, funder policies & practices [Micah Altman, 10/6/2009] Open Data Source: R.I.N. 2008 also see Borgman 2007; Niu 2006
  • 10. So when do things go wrong? Source: Reich & Rosenthal 2005
  • 11. Confidentiality Restrictions for Personal Private Information Overlapping laws differ: People/subjects covered Organizations covered Required technical and procedural controls Definition of identifiability Some Strategies Consent for sharing up front Commercialize Observe public activity Share aggregates only De-identify Recent Statistical Results (Oversimplified  ) De-identification often leaks Aggregation sometimes leaks Not included : EU directives, foreign laws, ANPRM Request for Comment on proposed revisions to 45 CFR 46 www.hhs.gov/ohrp/humansubjects/anprm2011page.html
  • 12. Integrating Tools Data Sharing & Data Citation
  • 13. Data Management - Goals Data Sharing & Data Citation
  • 14. Data Management Elements Data Sharing & Data Citation
  • 15. Core Requirements for Data Sharing Infrastructure Stakeholder incentives recognition; citation; payment; compliance; services Dissemination access to metadata; documentation; data Access control authentication; authorization; rights management Provenance chain of control; verification of metadata, bits, semantic content Persistence bits; semantic content; use Legal protection rights management; consent; record keeping; auditing Usability discovery; deposit; curation; administration; collaboration Business model Data Sharing & Data Citation Sources: King 2007; ICSU 2004; NSB 2005
  • 16. Why is Infrastructure for Data Sharing Necessary? Accessibility: Many large data sets: in public archives Most data in published articles: not accessible, results not replicable without the original author Most data sets from federal grants: not publicly available Problems with discovery and linking even with professional archives: Data in different archives have different identifiers Archives change identifiers, links Changes to data are made; identifiers are reused or removed; old data are lost Locating/browsing/extracting requires specialized tools & approaches Sharing data requires exposing tacit knowledge Explicit documentation of data structure, collection process, interpretation Harmonizing/linking to known ontologies, metadata schemas, vocabularies Data sets are not preserved like books Static data files (even if on the web): unreadable after a few years When storage methods change: some data sets are lost; others have altered content! Why not Single Centralized infrastructure ? Single point of failure Difficult when data are heterogeneous in format, origin, size, effort needed to collect or analyze, legal access rules, etc. Data producers want credit, control, and visibility Data Sharing & Data Citation
  • 17. Dataverse For Organizations For Scholars Brand it like your own website. Upload any type of data. Establish a persistent data citation Facilitate data discovery Provide live analysis Receive permanent storage space Used by archives, libraries, journals, schools Enable contributors to upload data Organize studies by collections Search across a universe of data Control access and terms of use Federate with catalogs and partners: 
OAI-PMH, LOCKSS, Z39.50, DDI Gateway to over 39000 social science studies (world’s largest catalog) Web Virtual Hosting 2.0 Service -- Over 350 virtual archives Federated search and delivery
  • 18. Virtual Archive: Scholar Site Scholar retains control over branding and dissemination Preservation and long-term access is guaranteed Dissemination and compliance with Data Manage Plans is verifiable Integrates with OpenScholar Data Sharing & Data Citation
  • 20. Mind the Gaps GAP: Coverage across entire lifecycle -- decoupling of dissemination, formal publication, long-term access, reuse GAP: Interoperability and integration across tools GAP: Maturity and sustainability of tools --- most tools have small communities of maintainers, particular worrisome w/lack of interoperability Data Sharing & Data Citation design publishing dissemination archiving reuse collection processing integration analysis cati / capi Enhanced publication (sweave) identifiers Google-__________ data archives, hosting, networks General digital libraries and repositories Scientific workflow systems
  • 21. Supporting Institutions Data Sharing & Data Citation
  • 22. Institutional Data Access Strategies* “ Ignore it, maybe someone else will take care of it” (internet archive, …) “ We’ll always be here” (self-preservation) Let the publishers do It “ We are ever true to [Insert Alma Mater]” (institutional archives) “ Ask us (domain archive) to do it” (ICPSR, MRA, Roper, …) “ Ask someone(s) else do it” (Data-PASS, Meta-Archive, ClockSS) “ Trust No One” (LOCKSS) Data Sharing & Data Citation *All quotes are entirely fictional :-)
  • 23. Institutional Preservation Strategies -- Corollaries There are potential single points of failure in both technology, organization and legal regimes: Diversify your portfolio: multiple software systems, hardware, organization (e.g., Data-PASS :-) Seek international partners Many combinations of preservation & dissemination strategies are compatible: Layer technologies and strategies Leverage dissemination (in a planned way) for preservation (and vice-versa) Preservation is impossible to demonstrate conclusively: Consider organizational credentials No organization is absolutely certain to be reliable Data Sharing & Data Citation
  • 24. Partnership Agreements MOU Secession Plans & Agreements Coordinating Operations Development of shared procedures Joint “ Not-bad ” practices Identification & selection Metadata Confidentiality Shared Catalog Unified Discovery Content replication Data-PASS is a broad-based partnership of data archives dedicated to acquiring and preserving data at-risk of being lost to the social science research community. Data-PASS partners have rescued thousands of data sets and created the largest catalog of social science data in existence. Data-PASS partners collaborate to identify and promote good archival practices, seek out at-risk research data, build preservation infrastructure, and mutually safeguard Data Sharing & Data Citation
  • 25. Ideal integration of policy and technology? Expressed in high-level domain/business language Captures a significant portion of business domain Translated to a formal schematization Automatically measurable Directly controls procedures & actions to achieve compliance Verifiable translation from business domain policy Data Sharing & Data Citation Policy: A set of rules and objectives expressed at a high level domain that controls actions at a lower level
  • 26. Data Sharing & Data Citation “ The repository system must be able to identify the number of copies of all stored digital objects, and the location of each object and their copies.” Policy Schematization Behavior (Operationalization)
  • 27. SafeArchive: TRAC-Based Management of LOCKSS Facilitating collaborative replication and preservation with technology… Collaborators declare explicit non-uniform resource commitments Policy records commitments, storage network properties Storage layer provides replication, integrity, freshness, versioning SafeArchive software provides monitoring, auditing, and provisioning Content is harvested through HTTP (LOCKSS) or OAI-PMH Integration of LOCKSS, The Dataverse Network, TRAC Data Sharing & Data Citation
  • 28. Aligning Incentives Data Sharing & Data Citation
  • 29. Stakeholders & Information Flow Data Sharing & Data Citation Data Collection Publication of Research Products
  • 30. Data Citation as a Leverage Point Services Identifiers to specific fixed versions of data are needed to establish unambiguous chains of provenance Identifiers that can be globally resolved to machine-understandable metadata and to identified object are needed to building generalized access and analysis services Persistence of identifiers are needed to maintain long-term access Incentives Scholarly credit (intellectual attribution) is a large motivator for many researchers – citation creates incentive for researchers to publish data Scholars also comply with enforceable journal policies -- requiring data citation is a light-weight method to make data access policies auditable Impact/usage is a motivator for public research funders – data citation provides foundation for measures of usage and impact Data Sharing & Data Citation
  • 31. Data Sharing & Data Citation Common Principles
  • 32. Data Sharing & Data Citation
  • 33. Thanks to 37 Participants Data Sharing & Data Citation
  • 34. What is a citation? Data Sharing & Data Citation
  • 35. Data Sharing & Data Citation
  • 36. Workflow Data Sharing & Data Citation
  • 37. Workflow Data Sharing & Data Citation
  • 38. - Separate scientific principles, use cases, requirements Distinguish syntax, semantics, from presentation Design for ecosystem & lifecycle Incremental value for incremental effort - Think Globally, Act Locally Design Principles Data Sharing & Data Citation
  • 39. Theory Data Sharing & Data Citation
  • 40. Theory + Data Sharing & Data Citation Data citations should be first class objects for publication -- appear with citation; should be as easy to cite as other works At minimum, all data necessary to understand assess extend conclusions in scholarly work should be cited Citations should persist and enable access to fixed version of data at least as long as citing work Data citation should support unambiguous attribution of credit to all contributors, possibly through the citation ecosystem
  • 41. Theory + Practice Data Sharing & Data Citation
  • 42. Use Cases Data Sharing & Data Citation
  • 43. Use Cases (details) Data Sharing & Data Citation Operational Constraints? -Syntax -Interoperability -Technical contexts of use
  • 44. Actors Data Sharing & Data Citation
  • 45. Semantic : Persistent ID, Author, Title, Version (or at least date) Presentation : Any style Grouped with other references Actionable in context Policy Treat data cites as first class If its needed support a claim, cite it Offer credit to contributors Simple Proposal Data Sharing & Data Citation
  • 46. We cannot depend on a single tool -- plans for integration and interoperability through citations and linking mechanisms, interchange formats, ontology hooks, protocols ? Large portion of benefit from data sharing arises from open access … -- how can OpenShare “nudge” researchers toward Open Data? Individual researchers cannot ensure long-term access -- how will OpenShapa fit in institutional ecosystem? Discussion
  • 47. Contact Micah Altman futurelib.org Data Sharing & Data Citation

Editor's Notes

  • #2: This work by Micah Altman (http://redistricting.info) is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.