SlideShare a Scribd company logo
3
Most read
4
Most read
11
Most read
Transparency and
reproducibility in research
Louise Corti and Diarmuid McDonnell
UK Data Service
UK Data Archive, University of Essex
Copyright © [year] UK Data Service. Created by [Organisation], [Institution]
ESS Summer school: An introduction to using big data in
the social sciences
20-24 July 2020
University of Essex, Colchester
Issues
• The reproducibility agenda
• What does transparency mean for big data?
• Where to publish data and code
• Reproducing a research output
Science is Under Attack
• Big challenges to science
• Fraud and mistakes
• Publication bias
• Political pressures
• Raw data availability
Protecting the Legitimacy of Research
Transparency is now becoming an obligation
 Professional ethics guidance
 Funders require public access to data
 Journal guidelines
 Transparency in methods
 Data access statements
 Pre-analysis plans
 Robust data citation
Transparency Organizations
Journal / Publisher Data Policies
• Most science journals require a Data Availability
Statement
• Ans some have formalised data replication policies
• E.g. Science, Nature, PLOS ONE
“PLOS ONE will not consider a study if the conclusions depend solely on
the analysis of proprietary data … the paper must include an analysis of
public data that validates the conclusions so others can reproduce the analysis.”
• e.g. BioMed Central open data statement
• Digital Object Identifiers (DOIs) usually required
Social science: journal transparency
• Replication policies exist for psychology, economics and
political science journals
 The Journal of Development Economics
 Quarterly Journal of Economics
 Quarterly Journal of Political Science
• Options:
• Some require a preregistered hypothesis/design
• Authors supply enough information that the exact analysis can be
replicated - raw data (anonymized if necessary), data collection
protocols, computer programs and scripts
• Editors required to replicate the study
• Journal may have own repository, e.g. Dataverse
https://dataverse.harvard.edu/dataverse/ajps
The American Economic Review
It is the policy of the American Economic Review to publish papers only if
the data used in the analysis are clearly and precisely documented and are
readily available to any researcher for purposes of replication. Authors of
accepted papers that contain empirical work, simulations, or experimental
work must provide to the Review, prior to publication, the data, programs,
and other details of the computations sufficient to permit replication. These
will be posted on the AER Web site. The Editor should be notified at the
time of submission if the data used in a paper are proprietary or if, for some
other reason, the requirements above cannot be met.
As soon as possible after acceptance, authors are expected to send their
data, programs, and sufficient details to permit replication, in electronic
form, to the AER office.
Data Availability Policy
The case of political science journals
• Data Access and Research Transparency (DA-RT):
Joint Statement by Political Science Journal
Editors
“Journal editors commit their respective journals to the principles of
data access and research transparency, and to implementing
policies requiring authors to make as accessible as possible the
empirical foundation and logic of inquiry of evidence-based
research”
http://media.wix.com/ugd/fa8393_da017d3fed824cf587932534c860ea25.pdf
• 26 journals signed up by 2016
• Some established rules for implementation
DA-RT initiative
1. Data Transparency - whereby researchers publicize
the data they use as evidence. Includes primary data
collected by researchers and secondary sources they
may use in their work
2. Analytic Transparency - where researchers publicize
how they measure, code, interpret, and analyze that
data
3. Process (or Production) Transparency - where
researchers explain how they came to choose their
research design, and why they used particular sets of
data, theories, and methods
Good research relies on visibility of process
Data Sharing
Research
Transparency
Data used to support an evidence-based claim
Data made
available for
secondary
analysis
Replication Data Policy
Guidelines for Accepted Articles:
The manuscript will not be published unless the first
footnote explicitly states where the data used in the study
can be obtained for purposes of replication and any
sources that funded the research. All replication files must
be stored on the AJPS Data Archive on Dataverse.
How to prepare a replication file: https://ajpsblogging.files.wordpress.com/2015/03/ajps-
guide-for-replic-materials-1-0.pdf
Requirements for replication files
 Readme File - names of all files in the study plus
description
 Analysis Dataset(s) – each Study must include one or
more files containing the data required to reproduce all
tables, figures, and other analytic results reported in
the article
 Code book
 Information and software commands to reconstruct an
analysis dataset from another source
 Syntax for new derived variables
 Access instructions
 Code for the analysis
Data intensive research?
• In analyses based upon highly data-intensive
procedures not necessary to provide full contents of
each replicated dataset – instead full set of relevant
results and software command files
 Stata .do files, R command scripts, text files
 Comment statements should be used extensively
 Links to data sources used (even if being continually updated)
 Software version used e.g. R version 3.1.0
Example: climate models
• Models based on mathematical representations of the
climate system
• Expressed as code or syntax, run on powerful computers
Confidence in models:
• Model that are based on established physical laws
• mass, energy and momentum
• wealth of natural environmental observations
• The ability of models to simulate important aspects of the
current climate
• Evaluation undertaken by organised multi-model comparisons;
weather forecasts
How Reliable Are the Models Used to Make
Projections of Future Climate Change?
Questions to ask when trying to attempt replication
• Where might you get the data from? URL or a DOI?
• Is there any access issues with reusing the data?
• How many separate data files might you need?
• Can you get raw or derived/processed data?
• Can you link the data types easily?
• Which variables are needed? Can these be regrouped or
recoded?
• Which variables were omitted from the models? Do you
need them in the dataset
• What about weights for data – item non-response?
• Is the syntax/code clear?
Data, replication data and evidence
• Research data underpinning a paper still often not available -
even on on request. Sociology lagging behind
• Formally publishing a replication dataset provides an
opportunity to be reproducible!
• Publishing the whole dataset from a research study brings
added advantages for the creator– longevity, robust citation
and proper attribution
Transparency and reproducibility in research
Transparency and reproducibility in research
Data papers
Questions and exercise

More Related Content

PDF
Dna isolation from various sources
PPTX
Database management system
PPTX
Research Transparency in the Social Sciences: DA-RT
PPTX
2013 DataCite Summer Meeting - Closing Keynote: Building Community Engagement...
PPTX
Reproducibility
PPT
Improving Access to Research Data: What does changing legislation mean for y...
PPTX
Love Your Code Workshop Introduction_Corti_Engeli
PDF
Can data access combat fake news?
Dna isolation from various sources
Database management system
Research Transparency in the Social Sciences: DA-RT
2013 DataCite Summer Meeting - Closing Keynote: Building Community Engagement...
Reproducibility
Improving Access to Research Data: What does changing legislation mean for y...
Love Your Code Workshop Introduction_Corti_Engeli
Can data access combat fake news?

Similar to Transparency and reproducibility in research (20)

PPTX
IEDA Data Publication Workshop @AGU
PPTX
From Data Policy Towards FAIR Data For All: How standardised data policies ca...
PDF
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
PDF
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
PDF
Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The ...
PDF
Six things publishers can do to promote open research data
PPTX
Data as a research output and a research asset: the case for Open Science/Sim...
PPTX
Research Data Sharing: A Basic Framework
PPTX
Data Citation for small publishers
PPTX
Hodson-Introduction-nfdp13
PDF
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
PDF
Making your research data open
PDF
Making your research data open
PPT
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
PPTX
Paul Davidson – Opening up public data to improve transparancy and efficiency
PPTX
Sharing data
PDF
ManagmentScience2023-Workingpaper.......
PDF
Enhancing Understanding of Physics through Simulations
PDF
Ethical Considerations in Science Education
PPTX
La ricerca scientifica nell'era dei Big Data - Sabina Leonelli
IEDA Data Publication Workshop @AGU
From Data Policy Towards FAIR Data For All: How standardised data policies ca...
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The ...
Six things publishers can do to promote open research data
Data as a research output and a research asset: the case for Open Science/Sim...
Research Data Sharing: A Basic Framework
Data Citation for small publishers
Hodson-Introduction-nfdp13
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Making your research data open
Making your research data open
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
Paul Davidson – Opening up public data to improve transparancy and efficiency
Sharing data
ManagmentScience2023-Workingpaper.......
Enhancing Understanding of Physics through Simulations
Ethical Considerations in Science Education
La ricerca scientifica nell'era dei Big Data - Sabina Leonelli
Ad

More from Louise Corti (9)

PPTX
Making the most of Open Data
PPTX
The role of open data in enhancing reproducibility
PPTX
Accessing data for research: data publishing pathways and the Five Safes
PPTX
Use of data in safe havens: ethics and reproducibility issues
PPTX
UKRN workshop 20201022_Corti
PDF
Engaging with students and researchers: the case of the social sciences
PDF
Incentivising the uptake of reusable metadata in the survey production process
PDF
The art of depositing social science data: maximising quality and ensuring go...
PDF
How metadata drives data sharing; UK Data Archive
Making the most of Open Data
The role of open data in enhancing reproducibility
Accessing data for research: data publishing pathways and the Five Safes
Use of data in safe havens: ethics and reproducibility issues
UKRN workshop 20201022_Corti
Engaging with students and researchers: the case of the social sciences
Incentivising the uptake of reusable metadata in the survey production process
The art of depositing social science data: maximising quality and ensuring go...
How metadata drives data sharing; UK Data Archive
Ad

Recently uploaded (20)

PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction to Knowledge Engineering Part 1
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
annual-report-2024-2025 original latest.
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
IB Computer Science - Internal Assessment.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Galatica Smart Energy Infrastructure Startup Pitch Deck
.pdf is not working space design for the following data for the following dat...
Introduction to Knowledge Engineering Part 1
Reliability_Chapter_ presentation 1221.5784
annual-report-2024-2025 original latest.
Fluorescence-microscope_Botany_detailed content
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Qualitative Qantitative and Mixed Methods.pptx
Database Infoormation System (DBIS).pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
IB Computer Science - Internal Assessment.pptx

Transparency and reproducibility in research

  • 1. Transparency and reproducibility in research Louise Corti and Diarmuid McDonnell UK Data Service UK Data Archive, University of Essex Copyright © [year] UK Data Service. Created by [Organisation], [Institution] ESS Summer school: An introduction to using big data in the social sciences 20-24 July 2020 University of Essex, Colchester
  • 2. Issues • The reproducibility agenda • What does transparency mean for big data? • Where to publish data and code • Reproducing a research output
  • 3. Science is Under Attack • Big challenges to science • Fraud and mistakes • Publication bias • Political pressures • Raw data availability
  • 4. Protecting the Legitimacy of Research Transparency is now becoming an obligation  Professional ethics guidance  Funders require public access to data  Journal guidelines  Transparency in methods  Data access statements  Pre-analysis plans  Robust data citation
  • 6. Journal / Publisher Data Policies • Most science journals require a Data Availability Statement • Ans some have formalised data replication policies • E.g. Science, Nature, PLOS ONE “PLOS ONE will not consider a study if the conclusions depend solely on the analysis of proprietary data … the paper must include an analysis of public data that validates the conclusions so others can reproduce the analysis.” • e.g. BioMed Central open data statement • Digital Object Identifiers (DOIs) usually required
  • 7. Social science: journal transparency • Replication policies exist for psychology, economics and political science journals  The Journal of Development Economics  Quarterly Journal of Economics  Quarterly Journal of Political Science • Options: • Some require a preregistered hypothesis/design • Authors supply enough information that the exact analysis can be replicated - raw data (anonymized if necessary), data collection protocols, computer programs and scripts • Editors required to replicate the study • Journal may have own repository, e.g. Dataverse https://dataverse.harvard.edu/dataverse/ajps
  • 8. The American Economic Review It is the policy of the American Economic Review to publish papers only if the data used in the analysis are clearly and precisely documented and are readily available to any researcher for purposes of replication. Authors of accepted papers that contain empirical work, simulations, or experimental work must provide to the Review, prior to publication, the data, programs, and other details of the computations sufficient to permit replication. These will be posted on the AER Web site. The Editor should be notified at the time of submission if the data used in a paper are proprietary or if, for some other reason, the requirements above cannot be met. As soon as possible after acceptance, authors are expected to send their data, programs, and sufficient details to permit replication, in electronic form, to the AER office. Data Availability Policy
  • 9. The case of political science journals • Data Access and Research Transparency (DA-RT): Joint Statement by Political Science Journal Editors “Journal editors commit their respective journals to the principles of data access and research transparency, and to implementing policies requiring authors to make as accessible as possible the empirical foundation and logic of inquiry of evidence-based research” http://media.wix.com/ugd/fa8393_da017d3fed824cf587932534c860ea25.pdf • 26 journals signed up by 2016 • Some established rules for implementation
  • 10. DA-RT initiative 1. Data Transparency - whereby researchers publicize the data they use as evidence. Includes primary data collected by researchers and secondary sources they may use in their work 2. Analytic Transparency - where researchers publicize how they measure, code, interpret, and analyze that data 3. Process (or Production) Transparency - where researchers explain how they came to choose their research design, and why they used particular sets of data, theories, and methods
  • 11. Good research relies on visibility of process Data Sharing Research Transparency Data used to support an evidence-based claim Data made available for secondary analysis
  • 12. Replication Data Policy Guidelines for Accepted Articles: The manuscript will not be published unless the first footnote explicitly states where the data used in the study can be obtained for purposes of replication and any sources that funded the research. All replication files must be stored on the AJPS Data Archive on Dataverse.
  • 13. How to prepare a replication file: https://ajpsblogging.files.wordpress.com/2015/03/ajps- guide-for-replic-materials-1-0.pdf
  • 14. Requirements for replication files  Readme File - names of all files in the study plus description  Analysis Dataset(s) – each Study must include one or more files containing the data required to reproduce all tables, figures, and other analytic results reported in the article  Code book  Information and software commands to reconstruct an analysis dataset from another source  Syntax for new derived variables  Access instructions  Code for the analysis
  • 15. Data intensive research? • In analyses based upon highly data-intensive procedures not necessary to provide full contents of each replicated dataset – instead full set of relevant results and software command files  Stata .do files, R command scripts, text files  Comment statements should be used extensively  Links to data sources used (even if being continually updated)  Software version used e.g. R version 3.1.0
  • 16. Example: climate models • Models based on mathematical representations of the climate system • Expressed as code or syntax, run on powerful computers Confidence in models: • Model that are based on established physical laws • mass, energy and momentum • wealth of natural environmental observations • The ability of models to simulate important aspects of the current climate • Evaluation undertaken by organised multi-model comparisons; weather forecasts
  • 17. How Reliable Are the Models Used to Make Projections of Future Climate Change?
  • 18. Questions to ask when trying to attempt replication • Where might you get the data from? URL or a DOI? • Is there any access issues with reusing the data? • How many separate data files might you need? • Can you get raw or derived/processed data? • Can you link the data types easily? • Which variables are needed? Can these be regrouped or recoded? • Which variables were omitted from the models? Do you need them in the dataset • What about weights for data – item non-response? • Is the syntax/code clear?
  • 19. Data, replication data and evidence • Research data underpinning a paper still often not available - even on on request. Sociology lagging behind • Formally publishing a replication dataset provides an opportunity to be reproducible! • Publishing the whole dataset from a research study brings added advantages for the creator– longevity, robust citation and proper attribution