SlideShare a Scribd company logo
Reproducibility:
The Myths and Truths of “Push-
Button” Bioinformatics
Simon Cockell

Bioinformatics Special Interest Group
19th July 2012
Repeatability and Reproducibility
• Main principle of
  scientific method
• Repeatability is „within
  lab‟
• Reproducibility is
  „between lab‟
  • Broader concept
• This should be easy in
 bioinformatics, right?
  • Same data + same code =
    same results
  • Not many analyses have
    stochastisicity

                              http://xkcd.com/242/
Same data?
• Example
  • Data deposited in SRA
  • Original data deleted by
    researchers
  • .sra files are NOT .fastq
  • All filtering/QC steps lost
  • Starting point for
    subsequent analysis not
    the same – regardless of
    whether same code
    used
Same data?
• Data files are very large
• Hardware failures are
  surprisingly common
• Not all hardware failures
  are catastrophic
  • Bit-flipping by faulty RAM
• Do you keep an
 md5sum of your data, to
 ensure it hasn‟t been
 corrupted by the transfer
 process?
Same code?
• What version of a
  particular software did
  you use?
• Is it still available?
• Did you write it yourself?
• Do you use version
  control?
• Did you tag a version?
• Is the software
  closed/proprietary?
Version Control
• Good practice for
  software AND data
• DVCS means it doesn‟t
  have to be in a remote
  repository
• All local folders can be
  versioned
 • Doesn‟t mean they have to
   be, it‟s a judgment call
• Check-in regularly
• Tag important “releases”
                       https://twitter.com/sjcockell/status/202041359920676864
Pipelines
• Package your analysis
• Easily repeatable
• Also easy to distribute
• Start-to-finish task
  automation
• Process captured by
  underlying pipeline
  architecture


              http://bioinformatics.knowledgeblog.org/2011/06/21/using-standardized-bioinformatics-formats-in-taverna-workflows-for-integrating-biological-data/
Tools for pipelining analyses
• Huge numbers
  • See: http://en.wikipedia.org/wiki/Bioinformatics_workflow_management_systems
  • Only a few widely used:
• Bash
  • old school
• Taverna
   • build workflows from public webservices
• Galaxy
  • sequencing focus – tools provided in „toolshed‟
• Microbase
  • distributed computing, build workflows from „responders‟
• e-Science Central
   • „Science as a Service‟ – cloud focus
   • not specifically a bioinformatics tool
Bash
• Single-machine (or
  cluster) command-line
  workflows
• No fancy GUIs
• Record provenance &
  process
• Rudimentary parallel
  processing



                          http://www.gnu.org/software/bash/
Reproducibility - The myths and truths of pipeline bioinformatics
Taverna
• Workflows from web
  services
• Lack of relevant services
 • Relies on providers
• Gluing services together
  increasingly problematic
• Sharing workflows
  through myExperiment
 • http://www.myexperiment.org/




                                  http://www.taverna.org.uk/
Galaxy
• “open, web-based
  platform for data
  intensive biomedical
  research”
• Install or use (limited)
  public server
• Can build workflows
  from tools in „toolshed‟
• Command-line tools
  wrapped with web
  interface
                             https://main.g2.bx.psu.edu/
Galaxy Workflow
Microbase
• Task management framework
• Workflows emerge from interacting „responders‟
• Notification system passes messages around
• „Cloud-ready‟ system that scales easily
• Responders must be written for new tools




                                    http://www.microbasecloud.com/
e-Science Central
• „Blocks‟ can be
  combined into
  workflows
• Blocks need to be
  written by an expert
• Social networking
  features
• Good provenance
  recording

                         http://www.esciencecentral.co.uk/
The best approach?
• Good for individual analysis
  • Package & publish
• All datasets different
  • One size does not fit all
  • Downstream processes often depend on results of upstream ones
• Note lack of QC
  • Requires human interaction – impossible to pipeline
  • Different every time
  • Subjective – major source of variation in results
  • BUT – important and necessary (GIGO)
More tools for reproducibility
• iPython notebook
   • http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html
   • Build notebooks with code embedded
   • Run code arbitrarily
   • Example: https://pilgrims.ncl.ac.uk:9999/
• Runmycode.org
  • Allows researchers to create „companion websites‟ for papers
  • This website allows readers to implement the methodology
    described in the paper
  • Example:
    http://www.runmycode.org/CompanionSite/site.do?siteId=92
The executable paper
• The ultimate in
  repeatable research
• Data and code
  embedded in the
  publication
• Figures can be
  generated, in situ, from
  the actual data
• http://ged.msu.edu/pap
  ers/2012-diginorm/
Summary
• For work to be repeatable:
  • Data and code must be available
  • Process must be documented (and preferably shared)
  • Version information is important
  • Pipelines are not the great panacea
    • Though they may help for parts of the process
    • Bash is as good as many „fancier‟ tools (for tasks on a single machine
      or cluster)
Inspirations for this talk
• C. Titus Brown‟s blogposts on repeatability and the
 executable paper
  • http://ivory.idyll.org/blog
• Michael Barton‟s blogposts about organising
 bioinformatics projects and pipelines
  • http://bioinformaticszen.com/

More Related Content

PDF
Scientific Workflows: what do we have, what do we miss?
PPTX
FAIR Computational Workflows
PDF
Building collaborative workflows for scientific data
PPTX
ELIXIR UK Node presentation to the ELIXIR Board
PPTX
FAIR Computational Workflows
PPTX
FAIR Computational Workflows
PPTX
Better software, better service, better research: The Software Sustainabilit...
PPTX
Research Object Community Update
Scientific Workflows: what do we have, what do we miss?
FAIR Computational Workflows
Building collaborative workflows for scientific data
ELIXIR UK Node presentation to the ELIXIR Board
FAIR Computational Workflows
FAIR Computational Workflows
Better software, better service, better research: The Software Sustainabilit...
Research Object Community Update

What's hot (20)

PPTX
The swings and roundabouts of a decade of fun and games with Research Objects
PPTX
FAIR Workflows and Research Objects get a Workout
PPTX
RO-Crate: A framework for packaging research products into FAIR Research Objects
PPT
Data management, data sharing: the SysMO-SEEK Story
PPTX
Open Science: how to serve the needs of the researcher?
PPTX
EOSC-Life Workflow Collaboratory
PPTX
FAIR Computational Workflows
PPTX
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
PPTX
A Big Picture in Research Data Management
PPTX
Building the FAIR Research Commons: A Data Driven Society of Scientists
PPTX
FAIR History and the Future
PPTX
FAIR Data Bridging from researcher data management to ELIXIR archives in the...
PPTX
FAIRy stories: the FAIR Data principles in theory and in practice
PDF
Building Federated FAIR Data Spaces, Yann Le Franc, EOSC-Pillar
PPTX
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
PPTX
The European Open Science Cloud: just what is it?
PPT
The Seven Deadly Sins of Bioinformatics
PPT
Invited talk @ ESIP summer meeting, 2009
PDF
COPO - Collaborative Open Plant Omics, by Rob Davey
PDF
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
The swings and roundabouts of a decade of fun and games with Research Objects
FAIR Workflows and Research Objects get a Workout
RO-Crate: A framework for packaging research products into FAIR Research Objects
Data management, data sharing: the SysMO-SEEK Story
Open Science: how to serve the needs of the researcher?
EOSC-Life Workflow Collaboratory
FAIR Computational Workflows
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
A Big Picture in Research Data Management
Building the FAIR Research Commons: A Data Driven Society of Scientists
FAIR History and the Future
FAIR Data Bridging from researcher data management to ELIXIR archives in the...
FAIRy stories: the FAIR Data principles in theory and in practice
Building Federated FAIR Data Spaces, Yann Le Franc, EOSC-Pillar
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
The European Open Science Cloud: just what is it?
The Seven Deadly Sins of Bioinformatics
Invited talk @ ESIP summer meeting, 2009
COPO - Collaborative Open Plant Omics, by Rob Davey
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
Ad

Similar to Reproducibility - The myths and truths of pipeline bioinformatics (20)

PDF
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
PPTX
Beyond the Science Gateway
PDF
Open Source Visualization of Scientific Data
PDF
GlobusWorld 2015
PDF
Chemical Databases and Open Chemistry on the Desktop
PDF
Avogadro, Open Chemistry and Semantics
PDF
Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016
PPTX
Taverna workflows in the cloud
PPTX
dREG & SimVascular-Gateways-ECSS-Presentation
PDF
Open Chemistry: Input Preparation, Data Visualization & Analysis
PPTX
Advances in Scientific Workflow Environments
PPTX
Hadoop for Bioinformatics: Building a Scalable Variant Store
PPTX
02-Lifecycle.pptx
PPTX
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
PPTX
David Henthorn [Rose-Hulman Institute of Technology] | Illuminating the Dark ...
PDF
E Afgan - Zero to a bioinformatics analysis platform in four minutes
PDF
High Performance Machine Learning in R with H2O
PDF
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
PDF
Apache Spark Presentation good for big data
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
Beyond the Science Gateway
Open Source Visualization of Scientific Data
GlobusWorld 2015
Chemical Databases and Open Chemistry on the Desktop
Avogadro, Open Chemistry and Semantics
Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016
Taverna workflows in the cloud
dREG & SimVascular-Gateways-ECSS-Presentation
Open Chemistry: Input Preparation, Data Visualization & Analysis
Advances in Scientific Workflow Environments
Hadoop for Bioinformatics: Building a Scalable Variant Store
02-Lifecycle.pptx
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
David Henthorn [Rose-Hulman Institute of Technology] | Illuminating the Dark ...
E Afgan - Zero to a bioinformatics analysis platform in four minutes
High Performance Machine Learning in R with H2O
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Apache Spark Presentation good for big data
Ad

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
KodekX | Application Modernization Development
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Cloud computing and distributed systems.
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
A Presentation on Artificial Intelligence
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Approach and Philosophy of On baking technology
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Encapsulation theory and applications.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
KodekX | Application Modernization Development
Unlocking AI with Model Context Protocol (MCP)
NewMind AI Weekly Chronicles - August'25 Week I
Diabetes mellitus diagnosis method based random forest with bat algorithm
Dropbox Q2 2025 Financial Results & Investor Presentation
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Empathic Computing: Creating Shared Understanding
Understanding_Digital_Forensics_Presentation.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Cloud computing and distributed systems.
Advanced methodologies resolving dimensionality complications for autism neur...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
A Presentation on Artificial Intelligence
“AI and Expert System Decision Support & Business Intelligence Systems”
Approach and Philosophy of On baking technology
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Building Integrated photovoltaic BIPV_UPV.pdf

Reproducibility - The myths and truths of pipeline bioinformatics

  • 1. Reproducibility: The Myths and Truths of “Push- Button” Bioinformatics Simon Cockell Bioinformatics Special Interest Group 19th July 2012
  • 2. Repeatability and Reproducibility • Main principle of scientific method • Repeatability is „within lab‟ • Reproducibility is „between lab‟ • Broader concept • This should be easy in bioinformatics, right? • Same data + same code = same results • Not many analyses have stochastisicity http://xkcd.com/242/
  • 3. Same data? • Example • Data deposited in SRA • Original data deleted by researchers • .sra files are NOT .fastq • All filtering/QC steps lost • Starting point for subsequent analysis not the same – regardless of whether same code used
  • 4. Same data? • Data files are very large • Hardware failures are surprisingly common • Not all hardware failures are catastrophic • Bit-flipping by faulty RAM • Do you keep an md5sum of your data, to ensure it hasn‟t been corrupted by the transfer process?
  • 5. Same code? • What version of a particular software did you use? • Is it still available? • Did you write it yourself? • Do you use version control? • Did you tag a version? • Is the software closed/proprietary?
  • 6. Version Control • Good practice for software AND data • DVCS means it doesn‟t have to be in a remote repository • All local folders can be versioned • Doesn‟t mean they have to be, it‟s a judgment call • Check-in regularly • Tag important “releases” https://twitter.com/sjcockell/status/202041359920676864
  • 7. Pipelines • Package your analysis • Easily repeatable • Also easy to distribute • Start-to-finish task automation • Process captured by underlying pipeline architecture http://bioinformatics.knowledgeblog.org/2011/06/21/using-standardized-bioinformatics-formats-in-taverna-workflows-for-integrating-biological-data/
  • 8. Tools for pipelining analyses • Huge numbers • See: http://en.wikipedia.org/wiki/Bioinformatics_workflow_management_systems • Only a few widely used: • Bash • old school • Taverna • build workflows from public webservices • Galaxy • sequencing focus – tools provided in „toolshed‟ • Microbase • distributed computing, build workflows from „responders‟ • e-Science Central • „Science as a Service‟ – cloud focus • not specifically a bioinformatics tool
  • 9. Bash • Single-machine (or cluster) command-line workflows • No fancy GUIs • Record provenance & process • Rudimentary parallel processing http://www.gnu.org/software/bash/
  • 11. Taverna • Workflows from web services • Lack of relevant services • Relies on providers • Gluing services together increasingly problematic • Sharing workflows through myExperiment • http://www.myexperiment.org/ http://www.taverna.org.uk/
  • 12. Galaxy • “open, web-based platform for data intensive biomedical research” • Install or use (limited) public server • Can build workflows from tools in „toolshed‟ • Command-line tools wrapped with web interface https://main.g2.bx.psu.edu/
  • 14. Microbase • Task management framework • Workflows emerge from interacting „responders‟ • Notification system passes messages around • „Cloud-ready‟ system that scales easily • Responders must be written for new tools http://www.microbasecloud.com/
  • 15. e-Science Central • „Blocks‟ can be combined into workflows • Blocks need to be written by an expert • Social networking features • Good provenance recording http://www.esciencecentral.co.uk/
  • 16. The best approach? • Good for individual analysis • Package & publish • All datasets different • One size does not fit all • Downstream processes often depend on results of upstream ones • Note lack of QC • Requires human interaction – impossible to pipeline • Different every time • Subjective – major source of variation in results • BUT – important and necessary (GIGO)
  • 17. More tools for reproducibility • iPython notebook • http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html • Build notebooks with code embedded • Run code arbitrarily • Example: https://pilgrims.ncl.ac.uk:9999/ • Runmycode.org • Allows researchers to create „companion websites‟ for papers • This website allows readers to implement the methodology described in the paper • Example: http://www.runmycode.org/CompanionSite/site.do?siteId=92
  • 18. The executable paper • The ultimate in repeatable research • Data and code embedded in the publication • Figures can be generated, in situ, from the actual data • http://ged.msu.edu/pap ers/2012-diginorm/
  • 19. Summary • For work to be repeatable: • Data and code must be available • Process must be documented (and preferably shared) • Version information is important • Pipelines are not the great panacea • Though they may help for parts of the process • Bash is as good as many „fancier‟ tools (for tasks on a single machine or cluster)
  • 20. Inspirations for this talk • C. Titus Brown‟s blogposts on repeatability and the executable paper • http://ivory.idyll.org/blog • Michael Barton‟s blogposts about organising bioinformatics projects and pipelines • http://bioinformaticszen.com/

Editor's Notes

  • #10: Or any other Unix shell (maybe even Windows batch scripts)
  • #13: Expose often simplified set of options
  • #16: Can make arbitrary blocks in R, Java or Octave