Reproducibility - The myths and truths of pipeline bioinformatics

Reproducibility:
The Myths and Truths of “Push-
Button” Bioinformatics
Simon Cockell

Bioinformatics Special Interest Group
19th July 2012

Repeatability and Reproducibility
• Main principle of
scientific method
• Repeatability is „within
lab‟
• Reproducibility is
„between lab‟
• Broader concept
• This should be easy in
bioinformatics, right?
• Same data + same code =
same results
• Not many analyses have
stochastisicity

http://xkcd.com/242/

Same data?
• Example
• Data deposited in SRA
• Original data deleted by
researchers
• .sra files are NOT .fastq
• All filtering/QC steps lost
• Starting point for
subsequent analysis not
the same – regardless of
whether same code
used

Same data?
• Data files are very large
• Hardware failures are
surprisingly common
• Not all hardware failures
are catastrophic
• Bit-flipping by faulty RAM
• Do you keep an
md5sum of your data, to
ensure it hasn‟t been
corrupted by the transfer
process?

Same code?
• What version of a
particular software did
you use?
• Is it still available?
• Did you write it yourself?
• Do you use version
control?
• Did you tag a version?
• Is the software
closed/proprietary?

Version Control
• Good practice for
software AND data
• DVCS means it doesn‟t
have to be in a remote
repository
• All local folders can be
versioned
• Doesn‟t mean they have to
be, it‟s a judgment call
• Check-in regularly
• Tag important “releases”
https://twitter.com/sjcockell/status/202041359920676864

Pipelines
• Package your analysis
• Easily repeatable
• Also easy to distribute
• Start-to-finish task
automation
• Process captured by
underlying pipeline
architecture

http://bioinformatics.knowledgeblog.org/2011/06/21/using-standardized-bioinformatics-formats-in-taverna-workflows-for-integrating-biological-data/

Tools for pipelining analyses
• Huge numbers
• See: http://en.wikipedia.org/wiki/Bioinformatics_workflow_management_systems
• Only a few widely used:
• Bash
• old school
• Taverna
• build workflows from public webservices
• Galaxy
• sequencing focus – tools provided in „toolshed‟
• Microbase
• distributed computing, build workflows from „responders‟
• e-Science Central
• „Science as a Service‟ – cloud focus
• not specifically a bioinformatics tool

Bash
• Single-machine (or
cluster) command-line
workflows
• No fancy GUIs
• Record provenance &
process
• Rudimentary parallel
processing

http://www.gnu.org/software/bash/

Reproducibility - The myths and truths of pipeline bioinformatics

Taverna
• Workflows from web
services
• Lack of relevant services
• Relies on providers
• Gluing services together
increasingly problematic
• Sharing workflows
through myExperiment
• http://www.myexperiment.org/

http://www.taverna.org.uk/

Galaxy
• “open, web-based
platform for data
intensive biomedical
research”
• Install or use (limited)
public server
• Can build workflows
from tools in „toolshed‟
• Command-line tools
wrapped with web
interface
https://main.g2.bx.psu.edu/

Microbase
• Task management framework
• Workflows emerge from interacting „responders‟
• Notification system passes messages around
• „Cloud-ready‟ system that scales easily
• Responders must be written for new tools

http://www.microbasecloud.com/

e-Science Central
• „Blocks‟ can be
combined into
workflows
• Blocks need to be
written by an expert
• Social networking
features
• Good provenance
recording

http://www.esciencecentral.co.uk/

The best approach?
• Good for individual analysis
• Package & publish
• All datasets different
• One size does not fit all
• Downstream processes often depend on results of upstream ones
• Note lack of QC
• Requires human interaction – impossible to pipeline
• Different every time
• Subjective – major source of variation in results
• BUT – important and necessary (GIGO)

More tools for reproducibility
• iPython notebook
• http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html
• Build notebooks with code embedded
• Run code arbitrarily
• Example: https://pilgrims.ncl.ac.uk:9999/
• Runmycode.org
• Allows researchers to create „companion websites‟ for papers
• This website allows readers to implement the methodology
described in the paper
• Example:
http://www.runmycode.org/CompanionSite/site.do?siteId=92

The executable paper
• The ultimate in
repeatable research
• Data and code
embedded in the
publication
• Figures can be
generated, in situ, from
the actual data
• http://ged.msu.edu/pap
ers/2012-diginorm/

Summary
• For work to be repeatable:
• Data and code must be available
• Process must be documented (and preferably shared)
• Version information is important
• Pipelines are not the great panacea
• Though they may help for parts of the process
• Bash is as good as many „fancier‟ tools (for tasks on a single machine
or cluster)

Inspirations for this talk
• C. Titus Brown‟s blogposts on repeatability and the
executable paper
• http://ivory.idyll.org/blog
• Michael Barton‟s blogposts about organising
bioinformatics projects and pipelines
• http://bioinformaticszen.com/

Reproducibility - The myths and truths of pipeline bioinformatics

More Related Content

What's hot (20)

Similar to Reproducibility - The myths and truths of pipeline bioinformatics (20)

Recently uploaded (20)

Reproducibility - The myths and truths of pipeline bioinformatics

Editor's Notes