flowr streamlining computing workflows

speedupanalysisusingacomputingcluster
lowr∳
streamliningworkflows

➡when one needs to wrangle a lot of data
whybother?

➡and there are multiple steps involved
whybother?

➡esp. when some of the steps can be further
broken down and processed in parallel
whybother?

➡use a computing cluster, submit a web of
jobs
whybother?

jobs
whybother?
✓ Eﬀectively process a multi-step pipeline, spawning it across the
computing cluster

jobs
whybother?
computing cluster
✓ Reproducible and transparent, with cleanly structured execution logs

jobs
whybother?
computing cluster
✓ Track and re-run flows

jobs
whybother?
computing cluster
✓ Lean and Portable, with easy installation

jobs
whybother?
computing cluster
✓ Run the same pipeline in the cloud (using star cluster) OR a local machine

jobs
whybother?
computing cluster
✓ Run the same pipeline in the cloud (using star cluster) OR a local machine
✓ Supports multiple cluster computing platforms (torque, lsf, sge, slurm …)

fivesimpleterms,definingallrelationships

submission types
scatter serial

submission types
scatter serial
decide how pieces of a
single step are processed

submission types
scatter serial
all in parallel

submission types
scatter serial
all in parallel sequentially

serial
dependency types
burstgather
submission types
scatter serial

serial
dependency types
burstgather
submission types
scatter serial
decide the relationship
b/w steps

serial
dependency types
burstgather
submission types
scatter serial
b/w steps
many-to-many

serial
dependency types
burstgather
submission types
scatter serial
b/w steps
many-to-many
many-to-one

serial
dependency types
burstgather
submission types
scatter serial
b/w steps
many-to-many
many-to-one
one-to-manyall in parallel sequentially

one
to
many
submission
type
dependency
type
relationship
Usingagenomicsexampleflow,withflowrconcepts

fastq
sam
bam
fastq
sam
bam
fastq
sam
bam
fastq
sam
bam
fastq
sam
bam
scatter serial
many
to
many
one
to
many
submission
type
dependency
type
relationship

fastq
sam
bam
fastq
sam
bam
fastq
sam
bam
fastq
sam
bam
fastq
sam
bam
scatter serial
many
to
many
merged
bam
serial gather
many
to
one
one
to
many
submission
type
dependency
type
relationship

fastq
sam
bam
fastq
sam
bam
fastq
sam
bam
fastq
sam
bam
fastq
sam
bam
scatter serial
many
to
many
merged
bam
serial gather
many
to
one
alignment
stats
sort & index
Mutations
Copy Number
variation
Indel Calling
downstream analysis
scatter burst
one
to
many
submission
type
dependency
type
relationship

★ wewouldsleepforafewseconds
asimplepipeline,where

★ createafewsmallfiles

★ mergethosefiles

★ mergethosefiles
★ getthesizeoftheresultingmergedfile

echo 'Hello World !'
sleep 5
sleep 5
cat $RANDOM > tmp1
cat $RANDOM > tmp2
cat tmp1 tmp2 > tmp
du -sh tmp
simplepipelineinbash

sleep 5
sleep 5
cat $RANDOM > tmp1
cat $RANDOM > tmp2
cat tmp1 tmp2 > tmp
du -sh tmp
say Hello to
the world

sleep 5
sleep 5
cat $RANDOM > tmp1
cat $RANDOM > tmp2
cat tmp1 tmp2 > tmp
du -sh tmp
say Hello to
the world wait for a few
seconds…

sleep 5
sleep 5
cat $RANDOM > tmp1
cat $RANDOM > tmp2
cat tmp1 tmp2 > tmp
du -sh tmp
say Hello to
seconds…
create two
small ﬁles

sleep 5
sleep 5
cat $RANDOM > tmp1
cat $RANDOM > tmp2
cat tmp1 tmp2 > tmp
du -sh tmp
say Hello to
seconds…
create two
small ﬁles merge the
two ﬁles

sleep 5
sleep 5
cat $RANDOM > tmp1
cat $RANDOM > tmp2
cat tmp1 tmp2 > tmp
du -sh tmp
say Hello to
seconds…
create two
two ﬁles
check the size of
the resulting ﬁle

wrapbashcommandsintoR
hello='echo Hello World !'
sleep=c('sleep 5', 'sleep 5')
tmp=c('cat $RANDOM > tmp1',
'cat $RANDOM > tmp2')
merge='cat tmp1 tmp2 > tmp'
size='du -sh tmp'

size='du -sh tmp'
say Hello to
the world

size='du -sh tmp'
say Hello to
seconds…

size='du -sh tmp'
say Hello to
seconds…
create two
small ﬁles

size='du -sh tmp'
say Hello to
seconds…
create two
two ﬁles

size='du -sh tmp'
say Hello to
seconds…
create two
two ﬁles
check the size of
the resulting ﬁle

createatableofallcommands
library(flowr)
lst = list(hello=hello,
sleep=sleep,
tmp=tmp,
merge=merge,
size=size)
flowmat = to_flowmat(lst, "samp1")
size='du -sh tmp'

library(flowr)
sleep=sleep,
tmp=tmp,
merge=merge,
size=size)
size='du -sh tmp'
create a
named list

library(flowr)
sleep=sleep,
tmp=tmp,
merge=merge,
size=size)
size='du -sh tmp'
create a
named list
create a
table

library(flowr)
sleep=sleep,
tmp=tmp,
merge=merge,
size=size)
size='du -sh tmp'
create a
named list
create a
table
|samplename |jobname |cmd |
|:----------|:-------|:-------------------|
|samp1 |hello |echo Hello World ! |
|samp1 |sleep |sleep 5 |
|samp1 |tmp |cat $RANDOM > tmp1 |
|samp1 |merge |cat tmp1 tmp2 > tmp |
|samp1 |size |du -sh tmp |

library(flowr)
sleep=sleep,
tmp=tmp,
merge=merge,
size=size)
size='du -sh tmp'
create a
named list
create a
table
|:----------|:-------|:-------------------|
asimpletab-delimtable

connectthedots…
flowdefinitiondecidesthe
sequenceofsteps

flowdef = to_flowdef(flowmat,
sub_type = c("serial", "scatter", "scatter", "serial", "serial"),
dep_type = c("none", "burst", "serial", "gather", "serial"),
platform = "local")
createaflowdefinition

platform = "local")
|jobname |sub_type |prev_jobs |dep_type | cpu|
|:-------|:--------|:---------|:--------|---:|
|hello |serial |none |none | 1|
|sleep |scatter |hello |burst | 1|
|tmp |scatter |sleep |serial | 1|
|merge |serial |tmp |gather | 1|
|size |serial |merge |serial | 1|

platform = "local")
hello
dep: none
sub: serial
sleepsleepsleepsleep
dep: burst
sub: scatter
tmptmptmptmp
dep: serial
sub: scatter
merge
dep: gather
sub: serial
size
dep: serial
sub: serial
|:-------|:--------|:---------|:--------|---:|
plot_flow(flowdef)

stitchaflow…
useaflowmatandflowdef,
tocreateaflowobject

stitch&submittothecluster
(cloudorserver)

(cloudorserver)
|:-------|:--------|:---------|:--------|---:|
|:----------|:-------|:-------------------|
flowmat flowdef
+

(cloudorserver)
|:-------|:--------|:---------|:--------|---:|
|:----------|:-------|:-------------------|
flowmat flowdef
+
fobj = to_flow(flowmat, flowdef, execute = TRUE)

Working on: hello
|===== | 25%
Working on: sleep
|================ | 50%
Working on: merge
|==================================| 100%
Working on: size
Flow is being processed. Track it from R/Terminal using:
flowr status x=~/flowr/runs/flowname-samp1-20151005-16-01-38-M8WniKJo
OR from R using:
status(x='~/flowr/runs/flowname-samp1-20151005-16-01-38-M8WniKJo')
(cloudorserver)
|:-------|:--------|:---------|:--------|---:|
|:----------|:-------|:-------------------|
flowmat flowdef
+
fobj = to_flow(flowmat, flowdef, execute = TRUE)

submitaflow,then…
status()
monitor the status of a single ﬂow
OR multiple ﬂows

submitaflow,then…
status()
OR multiple ﬂows
kill()
kill all the associated jobs, of one or many ﬂows

submitaflow,then…
status()
OR multiple flows
kill()
kill all the associated jobs, of one or many flows
rerun()
One can rerun the flow from a intermediate step

github.com/sahilseth/ﬂowr
complete documentation: docs.ﬂowr.space
email: sahil.seth@me.com

- creativily define relationships using submission and dependency types
- each row describes resources for one step, providing full flexibility
Flow mat
samplename jobname cmd
sample1 A sleep 2 && sleep 5;echo hello
sample1 B head -c 100000 /dev/urandom > tmp1
sample1 C cat tmp1 tmp2 tmp3 > merged
sample1 D du -sh merged
sample1 D ls merged
Flow Definition
Define Relationships Resource Requirements
jobname
submission
type
previous
job(s)
dependency
type
queue memory time cpu platform
A scatter none none medium 163185 23:00 1 lsf
B scatter A serial medium 163185 23:00 1 lsf
C serial B gather medium 163185 23:00 1 lsf
D scatter C burst medium 163185 23:00 1 lsf

- creativily define relationships using submission and dependency types
- each row describes resources for one step, providing full flexibility
Flow mat
samplename jobname cmd
sample1 C cat tmp1 tmp2 tmp3 > merged
sample1 D du -sh merged
sample1 D ls merged
Flow Definition
Define Relationships Resource Requirements
jobname
submission
type
previous
job(s)
dependency
type
queue memory time cpu platform
A scatter none none medium 163185 23:00 1 lsf
B scatter A serial medium 163185 23:00 1 lsf
C serial B gather medium 163185 23:00 1 lsf
D scatter C burst medium 163185 23:00 1 lsf
- use any language to
create a flow mat (a tsv
file)
- cmd column defines
commands to run

flowr streamlining computing workflows

More Related Content

Viewers also liked (9)

Similar to flowr streamlining computing workflows (20)

Recently uploaded (20)

flowr streamlining computing workflows