SlideShare a Scribd company logo
Overcoming the Reproducibility Crisis:
and why I stopped worrying a learned to love open data (& methods)
Scott Edmunds
1st
July 2013
Harnessing Data-Driven Intelligence
Using networking power of the internet to tackle problems
Can ask new questions & find hidden patterns & connections
Build on each others efforts quicker & more efficiently
More collaborations across more disciplines
Harness wisdom of the crowds: crowdsourcing, citizen science,
crowdfunding
Enables:
Enabled by:
Removing silos, open licenses, transparency, immediacy
Dead trees not fit for purpose
18121665 1869
The problems with publishing
• Scholarly articles are merely advertisement of scholarship .
The actual scholarly artefacts, i.e. the data and computational
methods, which support the scholarship, remain largely
inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab
and reproducible research, 1995
• Lack of transparency, lack of credit for anything other than
“regular” dead tree publication.
• If there is interest in data, only to monetise & re-silo
• Traditional publishing policies and practices a hindrance
The consequences: growing replication gap
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
Out of 18 microarray papers, results
from 10 could not be reproduced
Out of 18 microarray papers, results
from 10 could not be reproduced
Consequences: increasing number of retractions
>15X increase in last decade
Strong correlation of “retraction index” with
higher impact factor
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html
2. Retracted Science and the Retraction Index ▿
http://iai.asm.org/content/79/10/3855.abstract?
Consequences: growing replication gap
1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html
3. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950
More retractions:
>15X increase in last decade
At current % > by 2045 as many papers published as
retracted
Insufficient methods
“Faked
research is
endemic in
China”
Global perceptions of Asian Research
Million RMB rewards for high IF publications = ?
475, 267 (2011)
New Scientist, 17th
Nov 2012: http://www.newscientist.com/article/mg21628910.300-fraud-fighter-faked-research-is-endemic-in-china.html
Nature, 29th
September 2010: http://www.nature.com/news/2010/100929/full/467511a.html
Science, 29th
November 2013: http://www.sciencemag.org/content/342/6162/1035.full
Nature 20th
July 2011: http://www.nature.com/news/2011/110720/full/475267a.html
“Faked
research is
endemic in
China”
Global perceptions of Asian Research
Million RMB rewards for high IF publications = ?
475, 267 (2011)
New Scientist, 17th
Nov 2012: http://www.newscientist.com/article/mg21628910.300-fraud-fighter-faked-research-is-endemic-in-china.html
Nature, 29th
September 2010: http://www.nature.com/news/2010/100929/full/467511a.html
Science, 29th
November 2013: http://www.sciencemag.org/content/342/6162/1035.full
Nature 20th
July 2011: http://www.nature.com/news/2011/110720/full/475267a.html
“Wide distribution of information is key to scientific progress, yet
traditionally, Chinese scientists have not systematically released
data or research findings, even after publication.“
“There have been widespread complaints from scientists inside
and outside China about this lack of transparency. ”
“Usually incomplete and unsystematic, [what little supporting
data released] are of little value to researchers and there is
evidence that this drives down a paper's citation numbers.”
STAP paper demonstrates problems:
…to publish protocols BEFORE analysis
…better access to supporting data
…more transparent & accountable review
…to publish replication studies
Need:
• Data
• Software
• Review
• Re-use…
= Credit
}
Credit where credit is overdue:
“One option would be to provide researchers who release data to public
repositories with a means of accreditation.”
“An ability to search the literature for all online papers that used a particular data
set would enable appropriate attribution for those who share. “
Nature Biotechnology 27, 579 (2009)
New incentives/credit
GigaSolution: deconstructing the paper
www.gigadb.org
www.gigasciencejournal.com
Utilizes big-data infrastructure and expertise from:
Combines and integrates:
Open-access journal
Data Publishing Platform
Data Analysis Platform
Rewarding open data
Validationchecks
Fail – submitter is
provided error report
Pass – dataset is
uploaded to
GigaDB.
Submission Workflow
Curator makes dataset public
(can be set as future date if
required)
DataCite
XML file
Excel
submission file
Submitter logs in to
GigaDB website and
uploads Excel
submission
GigaDB
DOI
assigned
Files
Submitter provides
files by ftp or
Aspera
XML is generated and
registered with DataCite
Curator
Review
Curator contacts submitter with
DOI citation and to arrange file
transfer (and resolve any other
questions/issues).
DOI 10.5524/100003
Genomic data from the
crab-eating
macaque/cynomolgus
monkey (Macaca
fascicularis) (2011)
Public GigaDB dataset
See: http://database.oxfordjournals.org/content/2014/bau018.abstract
• Aspera = 10-100x faster up/download than FTP
• Multi-omics/large scale biomedical data focus
• Provide (ISA) curation & integration with other DBs
(e.g. INSDC, MetaboLights, PRIDE, etc.)
For more see: http://database.oxfordjournals.org/content/2014/bau018.abstract
IRRI GALAXY
Beneficiaries/users of our work
IRRI GALAXY
Rice 3K project: 3,000 rice genomes, 13.4TB public data
Beneficiaries/users of our work
NO
Diverse Data Types
Cyber-centipedes & virtual worms
More transparency:
open peer review
BMC Series
Medical Journals
More transparency (and speed):
pre-prints
Real-time open-review = paper in arXiv + blogged reviews
Reward open & transparent review
http://tmblr.co/ZzXdssfOMJfywww.gigasciencejournal.com/content/2/1/10
Real-time open-review = paper in arXiv + blogged reviews
Reward open & transparent review
GigaScience + Publons = further credit for reviewers efforts
Reward open & transparent review
Readers are interested in open review
Next step to link to ORCID
Cloud
solutions?
Reward better handling of metadata…
Novel tools/formats for data interoperability/handling.
Implement workflows in a community-accepted format
http://galaxyproject.org
Over 36,000 main
Galaxy server users
Over 1,000 papers
citing Galaxy use
Over 55 Galaxy
servers deployed
Open source
Rewarding and aiding reproducibility
galaxy.cbiit.cuhk.edu.hk
Visualizations
& DOIs for workflows
Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)
How are we supporting data
reproducibility?
Data sets
Analyses
Linked to
Linked to
DOI
DOI
Open-Paper
Open-Review
DOI:10.1186/2047-217X-1-18
>26,000 accesses
Open-Code
7 reviewers tested data in ftp server & named reports published
DOI:10.5524/100044
Open-Pipelines
Open-Workflows
DOI:10.5524/100038
Open-Data
78GB CC0 data
Code in sourceforge under GPLv3:
http://soapdenovo2.sourceforge.net/>20,000 downloads
Enabled code to being picked apart by bloggers in wiki
http://homolog.us/wiki/index.php?title=SOAPdenovo2
7 referees downloaded & tested data, then signed reports
Reward open & transparent review
Post publication: bloggers pull apart code/reviews in blogs + wiki:
SOAPdenov2 wiki: http://homolog.us/wiki1/index.php?title=SOAPdenovo2
Homologus blogs: http://www.homolog.us/blogs/category/soapdenovo/
Reward open & transparent review
SOAPdenovo2 workflows implemented in
galaxy.cbiit.cuhk.edu.hk
SOAPdenovo2 workflows implemented in
galaxy.cbiit.cuhk.edu.hk
Implemented entire workflow in our Galaxy server, inc.:
• 3 pre-processing steps
• 4 SOAPdenovo modules
• 1 post processing steps
• Evaluation and visualization tools
Also will be available to download by >36K Galaxy users in
Can we reproduce results? SOAPdenovo2 S. aureus pipeline
Taking a microscope to peer review
The SOAPdenovo2 Case study
Subject to and test with 3 models:
DataData
Method/Experi
mental protocol
Method/Experi
mental protocol
FindingsFindings
Types of resources in an RO
Wfdesc/ISA-
TAB/ISA2OWL
Wfdesc/ISA-
TAB/ISA2OWL
Models to describe each resource type
Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)
1. While there are huge improvements to the quality of the
resulting assemblies, other than the tables it was not stressed in
the text that the speed of SOAPdenovo2 can be slightly slower
than SOAPdenovo v1.
2. In the testing an assessment section (page 3), based on the
correct results in table 2, where we say the scaffold N50 metric
is an order of magnitude longer from SOAPdenovo2 versus
SOAPdenovo1, this was actually 45 times longer
3. Also in the testing an assessment section, based on the
correct results in table 2, where we say SOAPdenovo2
produced a contig N50 1.53 times longer than ALL-PATHS, this
should be 2.18 times longer.
4. Finally in this section, where we say the correct assembly
length produced by SOAPdenovo2 was 3-80 fold longer than
SOAPdenovo1, this should be 3-64 fold longer.
Case Study: Lessons Learned
• Most published research findings are false. Or at least have
errors.
• Is possible to push button(s) & recreate a result from a paper
• On a semantic level (via nanopublications) can still have
minor errors in text (interpretation not data)
• Reproducibility is COSTLY. How much are you willing to
spend?
• Much easier to do this before rather than after publication
Aiding reproducibility of imaging studies
OMERO: providing
access to imaging data
View, filter, measure raw
images with direct links
from journal article.
See all image data, not
just cherry picked
examples.
Download and reprocess.
OMERO: Aiding reproducibility, adding value
The alternative...
...look but don't touch
Step 1: get your code out there
• Put everything in a code repository. Even if it is ugly (see CRAPL)
• Version control. Make sure you document exact version in the
paper (big problem with lots of our papers).
• If system environments are important, consider VMs
http://matt.might.net/articles/crapl/
Beyond Commenting Code:
Step 2: Open lab books, dynamic documents
• Facilitate reuse and sharing with tools like: Knitr, Sweave,
iPython Notebook
Sweave
• Working towards executable papers…
E.g.
E.g.
Some testimonials for Knitr
Authors (Wolfgang Huber)
“I do all my projects in Knitr. Having the textual
explanation, the associated code and the results all in one
place really increases productivity, and helps explaining
my analyses to colleagues, or even just to my future self.”
Reviewers (Christophe Pouzat)
“It took me a couple of hours to get the data, the few
custom developed routines, the “vignette” and to
REPRODUCE EXACTLY the analysis presented in the
manuscript. With few more hours, I was able to modify
the authors’ code to change their Fig. 4. In addition to
making the presented research trustworthy, the
reproducible research paradigm definitely makes the
reviewer’s job much more fun!
Full reproducibility
Levels of reproducibility
Dynamic results
Usability (e.g. Galaxy Toolshed)
Rich metadata, documentation
Basic code/data Availability
Give us data, papers &
pipelines*
What else can you
do?
scott@gigasciencejournal.com
editorial@gigasciencejournal.com
database@gigasciencejournal.com
Contact us:
* APC’s currently generously covered
by BGI until 2015
www.gigasciencejournal.com
Ruibang Luo (BGI/HKU)
Shaoguang Liang (BGI-SZ)
Tin-Lap Lee (CUHK)
Qiong Luo (HKUST)
Senghong Wang (HKUST)
Yan Zhou (HKUST)
Thanks to:
@gigascience
facebook.com/GigaScience
blogs.biomedcentral.com/gigablog/
Peter Li
Huayan Gao
Chris Hunter
Jesse Si Zhe
Nicole Nogoy
Laurie Goodman
Amye Kenall (BMC)
Marco Roos (LUMC)
Mark Thompson (LUMC)
Jun Zhao (Lancaster)
Susanna Sansone (Oxford)
Philippe Rocca-Serra (Oxford)
Alejandra Gonzalez-Beltran (Oxford)
www.gigadb.org
galaxy.cbiit.cuhk.edu.hk
www.gigasciencejournal.com
CBIITFunding from:
Our collaborators:team: Case study:

More Related Content

PPT
Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: th...
PPTX
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
PPT
Scott Edmunds ISMB talk on Big Data Publishing
PPTX
RARE and FAIR Science: Reproducibility and Research Objects
PPTX
Reproducibility and Scientific Research: why, what, where, when, who, how
PPT
The beauty of workflows and models
PPTX
SEEK for Science: A Data and Model Management Platform to support Open and Re...
PPT
Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era
Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: th...
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Scott Edmunds ISMB talk on Big Data Publishing
RARE and FAIR Science: Reproducibility and Research Objects
Reproducibility and Scientific Research: why, what, where, when, who, how
The beauty of workflows and models
SEEK for Science: A Data and Model Management Platform to support Open and Re...
Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era

What's hot (20)

PPT
Scott Edmunds, HKU Open Access Week: Experiences from the front-line of Open ...
PPTX
The culture of researchData
PPTX
HKU Data Curation MLIM7350 Class 7
PPT
How to Execute A Research Paper
PPTX
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
PPT
Peer Review and Science2.0
PDF
Executing the Research Paper
PPTX
Scott Edmunds, ReCon 2015: Beyond Dead Trees, Publishing Digital Research Obj...
PPT
NITLE Open Notebook Science Talk
PPTX
Scott Edmunds: Revolutionizing Data Dissemination: GigaScience
PPT
Philadelphia U Sciences 2011
PPTX
Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...
PPTX
Text and Data Mining explained at FTDM
PPT
Technology and Students: Mix, Match or Miss?
PPTX
Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in...
PPT
Open Notebook Science in Drug Discovery
PPTX
ContentMine + EPMC: Finding Zika!
PDF
The State of Open Research Data
PPT
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
Scott Edmunds, HKU Open Access Week: Experiences from the front-line of Open ...
The culture of researchData
HKU Data Curation MLIM7350 Class 7
How to Execute A Research Paper
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
Peer Review and Science2.0
Executing the Research Paper
Scott Edmunds, ReCon 2015: Beyond Dead Trees, Publishing Digital Research Obj...
NITLE Open Notebook Science Talk
Scott Edmunds: Revolutionizing Data Dissemination: GigaScience
Philadelphia U Sciences 2011
Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...
Text and Data Mining explained at FTDM
Technology and Students: Mix, Match or Miss?
Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in...
Open Notebook Science in Drug Discovery
ContentMine + EPMC: Finding Zika!
The State of Open Research Data
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
Ad

Viewers also liked (6)

PDF
ISA commons - overview and latest developments
PPTX
Repeat after me: Is our research reproducible (enough)?
PPTX
10 Recommendations from the Reproducibility Crisis in Psychological Science
PDF
Building collaborative workflows for scientific data
PDF
What is the reproducibility crisis in science and what can we do about it?
PDF
Extended ER Model and other Modelling Languages - Lecture 2 - Introduction to...
ISA commons - overview and latest developments
Repeat after me: Is our research reproducible (enough)?
10 Recommendations from the Reproducibility Crisis in Psychological Science
Building collaborative workflows for scientific data
What is the reproducibility crisis in science and what can we do about it?
Extended ER Model and other Modelling Languages - Lecture 2 - Introduction to...
Ad

Similar to Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods) (20)

PPT
Scott Edmunds at OASP Asia: Open (and Big) Data – the next challenge
PPT
Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing
PPT
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
PPTX
Reproducible research: theory
PPTX
Laurie Goodman at #CSE2014: Reproducibility: It's going to cost you time and ...
PPTX
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
PDF
Open Access Week - Oxford, 20-24 Oct 2014
PPT
The Era of Open
PPTX
Nicole Nogoy at the Auckland BMC RoadShow
PPTX
Nicole Nogoy: GigaScience...how licensing can change the way we do research
PPTX
Scott Edmunds: Using FAIR principles for more Open & Democratic Science
PPTX
Reproducibility
PPTX
2014 bosc-keynote
PPTX
Cartegena051811
PPTX
Scott Edmunds A*STAR open access workshop: how licensing can change the way w...
PPTX
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
PDF
RDA Scholarly Infrastructure 2015
PDF
What role can publishers play in the open data ecosystem?
PPTX
Research Objects for FAIRer Science
PPT
Some Early Thoughts
Scott Edmunds at OASP Asia: Open (and Big) Data – the next challenge
Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
Reproducible research: theory
Laurie Goodman at #CSE2014: Reproducibility: It's going to cost you time and ...
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Open Access Week - Oxford, 20-24 Oct 2014
The Era of Open
Nicole Nogoy at the Auckland BMC RoadShow
Nicole Nogoy: GigaScience...how licensing can change the way we do research
Scott Edmunds: Using FAIR principles for more Open & Democratic Science
Reproducibility
2014 bosc-keynote
Cartegena051811
Scott Edmunds A*STAR open access workshop: how licensing can change the way w...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
RDA Scholarly Infrastructure 2015
What role can publishers play in the open data ecosystem?
Research Objects for FAIRer Science
Some Early Thoughts

More from GigaScience, BGI Hong Kong (20)

PPTX
IDW2022: A decades experiences in transparent and interactive publication of ...
PPTX
Scott Edmunds: Preparing a data paper for GigaByte
PPTX
STM Week: Demonstrating bringing publications to life via an End-to-end XML p...
PPTX
Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...
PPTX
Scott Edmunds: A new publishing workflow for rapid dissemination of genomes u...
PPTX
Scott Edmunds: Quantifying how FAIR is Hong Kong: The Hong Kong Shareability ...
PPTX
Scott Edmunds talk at IARC: How can we make science more trustworthy and FAIR...
PPTX
PAGAsia19 - The Digitalization of Ruili Botanical Garden Project: Production...
PPTX
Democratising biodiversity and genomics research: open and citizen science to...
PPTX
Hong Kong Open Access & GigaScience: CCHK@10
PDF
Ricardo Wurmus: Reproducible genomics analysis pipelines with GNU Guix
PDF
Anil Thanki at #ICG13: Aequatus: An open-source homology browser
PPTX
Paul Pavlidis at #ICG13: Monitoring changes in the Gene Ontology and their im...
PDF
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
PDF
Stefan Prost at #ICG13: Genome analyses show strong selection on coloration, ...
PPTX
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
PPTX
Chris Armit at IDW2018: Democratising Data Publishing: A Global Perspective
PPTX
EMBL OA Week: FAIR or unfair? Principled publishing for more Open & Democrati...
PPTX
Reproducible method and benchmarking publishing for the data (and evidence) d...
PPTX
Mary Ann Tuli: What MODs can learn from Journals – a GigaDB curator’s perspec...
IDW2022: A decades experiences in transparent and interactive publication of ...
Scott Edmunds: Preparing a data paper for GigaByte
STM Week: Demonstrating bringing publications to life via an End-to-end XML p...
Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...
Scott Edmunds: A new publishing workflow for rapid dissemination of genomes u...
Scott Edmunds: Quantifying how FAIR is Hong Kong: The Hong Kong Shareability ...
Scott Edmunds talk at IARC: How can we make science more trustworthy and FAIR...
PAGAsia19 - The Digitalization of Ruili Botanical Garden Project: Production...
Democratising biodiversity and genomics research: open and citizen science to...
Hong Kong Open Access & GigaScience: CCHK@10
Ricardo Wurmus: Reproducible genomics analysis pipelines with GNU Guix
Anil Thanki at #ICG13: Aequatus: An open-source homology browser
Paul Pavlidis at #ICG13: Monitoring changes in the Gene Ontology and their im...
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
Stefan Prost at #ICG13: Genome analyses show strong selection on coloration, ...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Chris Armit at IDW2018: Democratising Data Publishing: A Global Perspective
EMBL OA Week: FAIR or unfair? Principled publishing for more Open & Democrati...
Reproducible method and benchmarking publishing for the data (and evidence) d...
Mary Ann Tuli: What MODs can learn from Journals – a GigaDB curator’s perspec...

Recently uploaded (20)

PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
Microbiology with diagram medical studies .pptx
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
2. Earth - The Living Planet earth and life
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
2. Earth - The Living Planet Module 2ELS
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
Microbiology with diagram medical studies .pptx
bbec55_b34400a7914c42429908233dbd381773.pdf
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
Introduction to Fisheries Biotechnology_Lesson 1.pptx
2. Earth - The Living Planet earth and life
ECG_Course_Presentation د.محمد صقران ppt
Biophysics 2.pdffffffffffffffffffffffffff
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
INTRODUCTION TO EVS | Concept of sustainability
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
2. Earth - The Living Planet Module 2ELS
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
7. General Toxicologyfor clinical phrmacy.pptx
Classification Systems_TAXONOMY_SCIENCE8.pptx
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
cpcsea ppt.pptxssssssssssssssjjdjdndndddd

Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

  • 1. Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods) Scott Edmunds 1st July 2013
  • 2. Harnessing Data-Driven Intelligence Using networking power of the internet to tackle problems Can ask new questions & find hidden patterns & connections Build on each others efforts quicker & more efficiently More collaborations across more disciplines Harness wisdom of the crowds: crowdsourcing, citizen science, crowdfunding Enables: Enabled by: Removing silos, open licenses, transparency, immediacy
  • 3. Dead trees not fit for purpose 18121665 1869
  • 4. The problems with publishing • Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995 • Lack of transparency, lack of credit for anything other than “regular” dead tree publication. • If there is interest in data, only to monetise & re-silo • Traditional publishing policies and practices a hindrance
  • 5. The consequences: growing replication gap 1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14 2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8) Out of 18 microarray papers, results from 10 could not be reproduced Out of 18 microarray papers, results from 10 could not be reproduced
  • 6. Consequences: increasing number of retractions >15X increase in last decade Strong correlation of “retraction index” with higher impact factor 1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html 2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?
  • 7. Consequences: growing replication gap 1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14 2. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html 3. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950 More retractions: >15X increase in last decade At current % > by 2045 as many papers published as retracted Insufficient methods
  • 8. “Faked research is endemic in China” Global perceptions of Asian Research Million RMB rewards for high IF publications = ? 475, 267 (2011) New Scientist, 17th Nov 2012: http://www.newscientist.com/article/mg21628910.300-fraud-fighter-faked-research-is-endemic-in-china.html Nature, 29th September 2010: http://www.nature.com/news/2010/100929/full/467511a.html Science, 29th November 2013: http://www.sciencemag.org/content/342/6162/1035.full Nature 20th July 2011: http://www.nature.com/news/2011/110720/full/475267a.html
  • 9. “Faked research is endemic in China” Global perceptions of Asian Research Million RMB rewards for high IF publications = ? 475, 267 (2011) New Scientist, 17th Nov 2012: http://www.newscientist.com/article/mg21628910.300-fraud-fighter-faked-research-is-endemic-in-china.html Nature, 29th September 2010: http://www.nature.com/news/2010/100929/full/467511a.html Science, 29th November 2013: http://www.sciencemag.org/content/342/6162/1035.full Nature 20th July 2011: http://www.nature.com/news/2011/110720/full/475267a.html “Wide distribution of information is key to scientific progress, yet traditionally, Chinese scientists have not systematically released data or research findings, even after publication.“ “There have been widespread complaints from scientists inside and outside China about this lack of transparency. ” “Usually incomplete and unsystematic, [what little supporting data released] are of little value to researchers and there is evidence that this drives down a paper's citation numbers.”
  • 10. STAP paper demonstrates problems: …to publish protocols BEFORE analysis …better access to supporting data …more transparent & accountable review …to publish replication studies Need:
  • 11. • Data • Software • Review • Re-use… = Credit } Credit where credit is overdue: “One option would be to provide researchers who release data to public repositories with a means of accreditation.” “An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “ Nature Biotechnology 27, 579 (2009) New incentives/credit
  • 12. GigaSolution: deconstructing the paper www.gigadb.org www.gigasciencejournal.com Utilizes big-data infrastructure and expertise from: Combines and integrates: Open-access journal Data Publishing Platform Data Analysis Platform
  • 14. Validationchecks Fail – submitter is provided error report Pass – dataset is uploaded to GigaDB. Submission Workflow Curator makes dataset public (can be set as future date if required) DataCite XML file Excel submission file Submitter logs in to GigaDB website and uploads Excel submission GigaDB DOI assigned Files Submitter provides files by ftp or Aspera XML is generated and registered with DataCite Curator Review Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues). DOI 10.5524/100003 Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011) Public GigaDB dataset See: http://database.oxfordjournals.org/content/2014/bau018.abstract
  • 15. • Aspera = 10-100x faster up/download than FTP • Multi-omics/large scale biomedical data focus • Provide (ISA) curation & integration with other DBs (e.g. INSDC, MetaboLights, PRIDE, etc.) For more see: http://database.oxfordjournals.org/content/2014/bau018.abstract
  • 17. IRRI GALAXY Rice 3K project: 3,000 rice genomes, 13.4TB public data Beneficiaries/users of our work
  • 19. More transparency: open peer review BMC Series Medical Journals
  • 20. More transparency (and speed): pre-prints
  • 21. Real-time open-review = paper in arXiv + blogged reviews Reward open & transparent review http://tmblr.co/ZzXdssfOMJfywww.gigasciencejournal.com/content/2/1/10
  • 22. Real-time open-review = paper in arXiv + blogged reviews Reward open & transparent review
  • 23. GigaScience + Publons = further credit for reviewers efforts Reward open & transparent review
  • 24. Readers are interested in open review Next step to link to ORCID
  • 25. Cloud solutions? Reward better handling of metadata… Novel tools/formats for data interoperability/handling.
  • 26. Implement workflows in a community-accepted format http://galaxyproject.org Over 36,000 main Galaxy server users Over 1,000 papers citing Galaxy use Over 55 Galaxy servers deployed Open source Rewarding and aiding reproducibility
  • 30. How are we supporting data reproducibility? Data sets Analyses Linked to Linked to DOI DOI Open-Paper Open-Review DOI:10.1186/2047-217X-1-18 >26,000 accesses Open-Code 7 reviewers tested data in ftp server & named reports published DOI:10.5524/100044 Open-Pipelines Open-Workflows DOI:10.5524/100038 Open-Data 78GB CC0 data Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/>20,000 downloads Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2
  • 31. 7 referees downloaded & tested data, then signed reports Reward open & transparent review
  • 32. Post publication: bloggers pull apart code/reviews in blogs + wiki: SOAPdenov2 wiki: http://homolog.us/wiki1/index.php?title=SOAPdenovo2 Homologus blogs: http://www.homolog.us/blogs/category/soapdenovo/ Reward open & transparent review
  • 33. SOAPdenovo2 workflows implemented in galaxy.cbiit.cuhk.edu.hk
  • 34. SOAPdenovo2 workflows implemented in galaxy.cbiit.cuhk.edu.hk Implemented entire workflow in our Galaxy server, inc.: • 3 pre-processing steps • 4 SOAPdenovo modules • 1 post processing steps • Evaluation and visualization tools Also will be available to download by >36K Galaxy users in
  • 35. Can we reproduce results? SOAPdenovo2 S. aureus pipeline
  • 36. Taking a microscope to peer review
  • 37. The SOAPdenovo2 Case study Subject to and test with 3 models: DataData Method/Experi mental protocol Method/Experi mental protocol FindingsFindings Types of resources in an RO Wfdesc/ISA- TAB/ISA2OWL Wfdesc/ISA- TAB/ISA2OWL Models to describe each resource type
  • 39. 1. While there are huge improvements to the quality of the resulting assemblies, other than the tables it was not stressed in the text that the speed of SOAPdenovo2 can be slightly slower than SOAPdenovo v1. 2. In the testing an assessment section (page 3), based on the correct results in table 2, where we say the scaffold N50 metric is an order of magnitude longer from SOAPdenovo2 versus SOAPdenovo1, this was actually 45 times longer 3. Also in the testing an assessment section, based on the correct results in table 2, where we say SOAPdenovo2 produced a contig N50 1.53 times longer than ALL-PATHS, this should be 2.18 times longer. 4. Finally in this section, where we say the correct assembly length produced by SOAPdenovo2 was 3-80 fold longer than SOAPdenovo1, this should be 3-64 fold longer.
  • 40. Case Study: Lessons Learned • Most published research findings are false. Or at least have errors. • Is possible to push button(s) & recreate a result from a paper • On a semantic level (via nanopublications) can still have minor errors in text (interpretation not data) • Reproducibility is COSTLY. How much are you willing to spend? • Much easier to do this before rather than after publication
  • 41. Aiding reproducibility of imaging studies OMERO: providing access to imaging data View, filter, measure raw images with direct links from journal article. See all image data, not just cherry picked examples. Download and reprocess.
  • 44. Step 1: get your code out there • Put everything in a code repository. Even if it is ugly (see CRAPL) • Version control. Make sure you document exact version in the paper (big problem with lots of our papers). • If system environments are important, consider VMs http://matt.might.net/articles/crapl/
  • 45. Beyond Commenting Code: Step 2: Open lab books, dynamic documents • Facilitate reuse and sharing with tools like: Knitr, Sweave, iPython Notebook Sweave • Working towards executable papers…
  • 46. E.g.
  • 47. E.g.
  • 48. Some testimonials for Knitr Authors (Wolfgang Huber) “I do all my projects in Knitr. Having the textual explanation, the associated code and the results all in one place really increases productivity, and helps explaining my analyses to colleagues, or even just to my future self.” Reviewers (Christophe Pouzat) “It took me a couple of hours to get the data, the few custom developed routines, the “vignette” and to REPRODUCE EXACTLY the analysis presented in the manuscript. With few more hours, I was able to modify the authors’ code to change their Fig. 4. In addition to making the presented research trustworthy, the reproducible research paradigm definitely makes the reviewer’s job much more fun!
  • 49. Full reproducibility Levels of reproducibility Dynamic results Usability (e.g. Galaxy Toolshed) Rich metadata, documentation Basic code/data Availability
  • 50. Give us data, papers & pipelines* What else can you do? scott@gigasciencejournal.com editorial@gigasciencejournal.com database@gigasciencejournal.com Contact us: * APC’s currently generously covered by BGI until 2015 www.gigasciencejournal.com
  • 51. Ruibang Luo (BGI/HKU) Shaoguang Liang (BGI-SZ) Tin-Lap Lee (CUHK) Qiong Luo (HKUST) Senghong Wang (HKUST) Yan Zhou (HKUST) Thanks to: @gigascience facebook.com/GigaScience blogs.biomedcentral.com/gigablog/ Peter Li Huayan Gao Chris Hunter Jesse Si Zhe Nicole Nogoy Laurie Goodman Amye Kenall (BMC) Marco Roos (LUMC) Mark Thompson (LUMC) Jun Zhao (Lancaster) Susanna Sansone (Oxford) Philippe Rocca-Serra (Oxford) Alejandra Gonzalez-Beltran (Oxford) www.gigadb.org galaxy.cbiit.cuhk.edu.hk www.gigasciencejournal.com CBIITFunding from: Our collaborators:team: Case study:

Editor's Notes

  • #27: Over 20,000 users on the main server Over 500 papers citing the use of Galaxy Over 55 servers deployed on the Web
  • #52: That just leaves me to thank the GigaScience team: Laurie, Scott, Alexandra, Peter and Jesse, BGI for their support - specifically Shaoguang for IT and bioinformatics support – our collaborators on the database, website and tools: Tin-Lap, Qiong, Senhong, Yan, the Cogini web design team, Datacite for providing the DOI service and the isacommons team for their support and advocacy for best practice use of metadata reporting and sharing. Thank you for listening.