Reproducible bioinformatics pipelines with Docker and Anduril

1
Reproducible Bioinformatics Pipelines
with Docker & Anduril
Christian Frech, PhD
Bioinformatician at Children‘s Cancer Research Institute, Vienna
CeMM Special Seminar
September 25th
, 2015

Why care about reproducible pipelines
in bioinformatics?
 For your (future) self
 Quickly re-run analysis with different parameters/tools
 Best documentation how results have been produced
 For others
 Allow others to easily reproduce your findings
(“reproducibility crisis”)*
 Code re-use between projects and colleagues
2
*) http://theconversation.com/science-is-in-a-reproducibility-crisis-how-do-we-resolve-it-16998

Obstacles to computational reproducibility
 Software/script not available (even upon request)
 Black box: Code (or even virtual machine) available, but no
documentation how to run it
 Dependency hell: Software and documentation available,
but (too) difficult to get it running
 Code rot: Code breaks over time due to software updates
 404 Not Found: unstable URLs, e.g. links to lab homepages
3
Go figure…

Computational pipelines to the rescue
 In bioinformatics, data analysis typically consists of a series of
heterogeneous programs stringed together via file-based
inputs and outputs
 Example: FASTQ -> alignment (BWA) -> variants calling (GATK) -> variant
annotation (SnpEff) -> custom R script
 Simple automation via (bash/R/Python/Perl) scripting has its
limitations
 No error checking
 No partial execution
 No parallelization
4

No shortage of pipeline frameworks
 Script-based
 GNU Make, Snakemake, Bpipe, Ruffus, Drake, Rake,
Nextflow, …
 GUI-based
 Galaxy, GenePattern, Chipster, Taverna, Pegasus, …
 Various commercial solutions for more standardized
workflows (e.g. RNA-seq)
 Geared toward biologists without programming skills
(“point-and-click”)
5
See also https://www.biostars.org/p/79, https://www.biostars.org/p/91301/

Personal wish list for pipeline framework
 Script-based (maximum flexibility, minimum overhead)
 Powerful scripting language
 Cluster integration (preferably via slurm)
 Modular (allow code re-use b/w projects and colleagues)
 Component library for frequent tasks (e.g. join two CSV files)
 Reporting (HTML, PDF) to share results
 Free & open-source
 Bundle scripts/data with execution environment
6

What’s wrong with good ol’ GNU make?
 Available on all Linux platforms
 Stood the test of time
(developed in 1970s)
 Rapid development
(Bash scripting + target rules)
 Multi-threading (-j parameter)
7
 No cluster support
 Arcane syntax, cryptic pattern
rules
 Half-baked multi-output rules
 No type checking (everything is a
generic file)
 Difficult to modularize
(code re-use)
 Rebuild not triggered by recipe
change
 No reporting
PRO CON

Anduril
8
http://www.anduril.org

Anduril
 Developed since 2008 at Biomedicum Systems Biology Laboratory,
Helsinki, Finland
 http://research.med.helsinki.fi/gsb/hautaniemi/
 Built for scientific data analysis with focus on bioinformatics
 Proprietary workflow scripting language “Anduril script”
 Possibility to embed native code (Bash/R/Python/Perl)
 Version 2 will switch to Scala
 Open source & free
 Significo (http://www.significo.fi/) is commercial spin-off offering Anduril
consulting services
 No widespread adoption (yet?)
9

Anduril features
 Script-based (maximum flexibility, less overhead)
 Expressive scripting language
 Modular to allow code re-use (b/w projects and colleagues)
 Ready-made component library for frequent analysis steps
10
X

Example workflow: RNA-seq alignment with GSNAP
inputBamDir = INPUT(path="/data/bam", recursive=false)
inputBamFiles = Folder2Array(folder1 = inputBamDir, filePattern = "C57C3ACXX_CV_([^_]+)_.*[.]bam$")
alignedBams = record()
for bam : std.iterArray(inputBamFiles) {
gsnap = GSNAP (
reads = INPUT(path=bam.file),
options = "--npaths=1 --max-mismatches=1 --novelsplicing=0",
@cpu = 10,
@memory = 40000,
@name = "gsnap_" + bam.key
)
alignedBams[bam.key] = gsnap.alignment
}
11
Anduril script
Execute with
$ anduril run workflow.and --exec-mode slurm
Distributed execution on cluster

Embedding native R code in Anduril script
12
ensembl = REvaluate(
table1 = ucsc,
script = StringInput(content=
'''
table.out <- table1
table.out$chrom <- gsub("^chr", "", table.out$chrom)
'''
)
)
Supports also inlining of Bash, Python, Java, and Perl scripts
Convert UCSC to Ensembl chromosome names in a CSV file
containing column ‘chrom’:

Anduril features
 Script-based (maximum flexibility, less overhead)
 Expressive scripting language
 Modular to allow code re-use (b/w projects and colleagues)
 Ready-made component library for frequent analysis steps
13
?

 “Lightweight” virtualization technology for Unix-based systems
 Processes run in isolated namespaces (“containers”), but share same kernel
 Like VMs: containers portable between systems -> reproducibility!
 Unlike VMs: instant startup, no resource pre-allocation -> better hardware utilization
14
VM Container

How to bundle workflow with execution environment?
15
Container
Anduril
Workflow
Component 1
Component 2
Component 3
Pro: Single container, easy to maintain
Con: VM-like approach; huge, monolithic
container, difficult to share (against Docker
philosophy)
Pro: Completely modularized, easy to re-
use/share workflow components
Con: “container hell”?
Workflow
Anduril
Solution 1 Solution 2
Container A
Component 1
Container B
Component 2
Container C
Component 3

Hybrid solution
16
Pro: Workflow completely containerized (= portable);
only shared components in common containers
Con: Still (but greatly reduced) overhead for container
maintenance
Workflow
Anduril
Container A
Component 1
Component 2
Component 3
Master container
Project- and user-
specific components
installed in master
container
Shared components
installed in common
container (e.g.
container “RNA-seq”)
“Docker inside
docker”

Dockerized GSNAP in Anduril
17
inputBamDir = INPUT(path="/data/bam", recursive=false)
inputBamFiles = Folder2Array(folder1 = inputBamDir, filePattern = "C57C3ACXX_CV_([^_]+)_.*[.]bam$")
alignedBams = record()
for bam : std.iterArray(inputBamFiles) {
gsnap = GSNAP (
reads = INPUT(path=bam.file),
options = "--npaths=1 --max-mismatches=1 --novelsplicing=0",
docker = "cfrech/anduril-gsnap-2015-09-21",
@cpu = 10,
@memory = 40000,
@name = "gsnap_" + bam.key
)
alignedBams[bam.key] = gsnap.alignment
}

So, Anduril is great… but
 Proprietary scripting language
 Biggest hurdle for widespread adoption IMO
 Will likely improve with version 2 (which uses Scala)
 Documentation opaque for beginners
 WANTED: Simple step-by-step guide to build your first Anduril workflow
 High upfront investment to get going (because of the above)
 In-lining Bash/R/Perl/Python should be simpler
 Currently too much clutter when using “BashEvaluate” and alike
 Coding in Anduril sometimes “feels heavy” compared to other
frameworks (e.g. GNU Make)
 Will improve with fluency in workflow scripting language
18

RNA-seq case study
Step 1: Configure Anduril workflow
title = “My project long title“
shortName = “My project short title“
authors = "Christian Frech"
// analyses to run
runNetworkAnalysis = true
runMutationAnalysis = true
runGSEA = true
// constants
PROJECT_BASE="/mnt/projects/myproject“
gtf = INPUT(path=PROJECT_BASE+"/data/Homo_sapiens.GRCh37.75.etv6runx1.gtf.gz")
referenceGenomeFasta = INPUT(path="/data/reference/human_g1k_v37.fasta")
...
20
+ description of samples, sample groups, and group comparisons in external
CSV file

RNA-seq case study
Step 2: Run Anduril workflow on cluster
$ anduril run main.and --exec-mode slurm
21

RNA-seq case study
Step 3: Go for lunch
22

RNA-seq case study
Step 4: Study PDF report
23

What follows are screenshots from this PDF report
24

QC: Distribution of expression values per sample
27

Vulcano plot for each comparison
29

Table report of DEGs for each comparison
30

Expression values of top diff. expressed
genes per comparison
31

GO term enrichment for each comparison
32

Interaction network of DEGs for each comparison
33

Chromosomal distribution of DEGs
34

GSEA heat map summarizing all comparisons
35
Rows = enriched gene sets
Columns = comparisons
Value = normalized enrichment score (NES)
Red = enriched for up-regulated genes
Blue = enriched for down-regulated genes
* = significant (FDR < 0.05)
** = highly significant (FDR < 0.01)

Future developments
 Push new Anduril components to public repository
(needs some refactoring, documentation, test cases)
 Help on Anduril2 manuscript
 Port custom Makefiles to Anduril (ongoing)
 Cloud deployment of dockerized workflow
 Couple slurm to AWS EC2
 Automatic spin-up of docker-enabled AMIs serving as
computing nodes
36

In the (not so) distant future …
$ docker pull cfrech/frech2015_et_al
$ docker run cfrech/frech2015_et_al --use-cloud --max-nodes 300 --out output
$ evince output/figure1.pdf
37

Further reading
 Discussion thread on Docker & Anduril
https://groups.google.com/forum/#!msg/anduril-dev/Et8-YG9O-Aw
38

Acknowledgement
39
 Marko Laakso (Significo)
 Sirku Kaarinen (Significo)
 Kristian Ovaska (Valuemotive)
 Pekka Lehti (Valuemotive)
 Ville Rantanen (University of
Helsinki, Hautaniemi lab)
 Nuno Andrade (CCRI)
 Andreas Heitger (CCRI)

Reproducible bioinformatics pipelines with Docker and Anduril

More Related Content

What's hot (20)

Viewers also liked (17)

Similar to Reproducible bioinformatics pipelines with Docker and Anduril (20)

Recently uploaded (20)

Reproducible bioinformatics pipelines with Docker and Anduril