Agile analysis development

Agile Analysis
Pipeline
Andy Brown
New Pipeline Development

Who Are We?
One of the world's largest DNA
Sequencing Centres
Second largest compute centre after
CERN in Europe

What Do We Do?
Human, Mouse, Zebrafish and
Pathogen Genome Projects
Post sequencing analysis, annotation
and maintenance
(It's never truly finished!)

Who Am I?
Tracking systems and analysis pipeline
for Next Generation Sequencing
Technologies
Perl, Web Technologies, Moose

Next Generation Sequencing?
Massively Parallel DNA Sequencing
Producing Millions of Reads per run
~38 instruments
~5Tb of data a day
Managing quick turnaround on Staging
of 320Tb data a month

Analysis
Convert Images to Bases
Obtain quality values
Recalibrate quality
Separate up DNA sequences from
different projects
Do this in parallel
Be able to extend this

Analysis
Current analysis running script was
unable to cope with changing demands

Run
Completes
Bustard
Adaptor
Removal
Split
by Tag
CIF
Qseq, Sig2
Split
by Tag
Calibrate
Scores
Index: rejectsIndex: rejects
Index: + tags
Split
by Tag
Split
by Tag
Split
by Tag
Create Cal
Table
Cal Table
Control Refs
Calibrate
Scores
Consent
Align
Index: + tags
Cal-Qseq
Consent
Align
K-mer Error
Correction
Cal-Qseq
Index: + consent
K-mer Error
Correction
K-mer Error
Correction
K-mer Error
Correction
K-mer Error
Correction
K-mer Error
Correction
Index: + consent
Index: + rejects
K-mer Error
Correction
K-mer Error
Correction
K-mer Error
Correction
Create Fastq
K-mer Error
Correction
K-mer Error
Correction
K-mer Error
Correction
Align to
Ref
Index: + rejects
Fastq
K-mer Error
Correction
K-mer Error
Correction
Create SRF
Control Refs
Sample Refs
Next
Page!
BAM
Initial Product
Creation
Initial Product
Creation
Gray boxes may
be pass-through

Control Refs
SRF
Sig2 Index fastq BAM
Run Summary (Summary.htm stuff)
IVC Plots
Q20 Counts
Fastqcheck
Insert Size Histogram
Error rates and QQ-Plots
Heatmaps
SNP Finder
... And Anything Else You Can Think Of
Human QC
Fuse
Archive
QC and Archival

Working in a Agile Manner
Current manner – still close to Cascade,
some idea of iterations
I wanted more agility – defined iterations
Got close

First Iteration - It1
Chop down the brief into stories
Spoke with creator of the brief, my boss
& team about what was needed
Pluggable, Automatic, Auto QC

It1: First bit of Coding
Read old code – anything I can steal –
yes!
Write some 'in principle' tests to get an
idea of the way to go.
Write some code for those tests.

It1: Prototype
Launch next
LaunchSelforFinish
LSF DEPENDENCIES

It1: Fail
Test Principle – Worked
Reality – Too Unwieldy

It1: Evaluation
Too much wrapping
Too much could go wrong with lots of parts
Out the Window!

Second Iteration - It2
So, I'm Agile. I don't see this as a set
back.
Opportunity to try a different approach.
I sketch it out.

Flag Waver
Function b Function c Function d Function eFunction a
Object to
Launch Ca
Object to
Launch Cb
Object to
Launch Cc
Object to
Launch Cd
Object to
Launch Ce
Component
a
Component
b
Component
c
Component
d
Component
e

It2: Second lot of Coding
Again, start off with in principle tests
Write some code to pass those tests
Select a bit of real world to apply it to

It2: Pass
This real world bit works
All jobs are launched as expected
Replace the old section with this bit
It still works :) A perfect replacement

It2: Evaluation
Success :)
The Flag Waver model - functions that
know what to do, but no knowledge of
other functions
This should make it pluggable

It2: Evaluation
Bulky data getting generated multiple
times over – Needs more DRYness

It3: Some new requests
It would be easier to code if we didn't
have users of the applications!
The first new request comes in for some
automated QC
Just launch them at the correct time

It3: Scrum
So, I scrum.
The objective: Work out priorities for
this iteration.
There are many 'stories', I decide on the
following.

It3: Scrum
Write something to make data
construction and passing more DRY
Write another replacement pipeline
section
Try to incorporate 1 QC into previous
pipeline section

It3: Tests
I write some tests to assess launching
the analysis pipeline
I write some tests to incorporate a QC
launch into the post analysis pipeline
I run the tests, which fail

It3: Code
I decide first to add the QC launch
My boss wants to start getting the data
I get a quick view of how pluggable the
system actually is
It is good :)

It3: Code
The analysis guys want their pipeline to
start showing up
Good reason - a new version of the
scripts have appeared, and they don't
want to patch the old
This takes the rest of the iteration

It3: Release
The most important release so far
Completely replace old code with new
Took about 2 days, with bug fixing

It3: Evaluation
Bugs on Release - tests don't always
prove everything!
No time to DRY out the code
Successful product into production
Old code has gone to 'silicon heaven'

It4: Scrum
I again scrum
So far, iterations have been quite quick
In order for some time to pass for the
pipeline, I decide to do refactoring this
time

It4: Scrum
Utilising more Inheritance (using Moose
Roles)
Create external role to translate
attributes without building hashes each
time

It4: In Brief
After 2 weeks
» a nicely refactored pipeline
» external role to DRY out data (released to
CPAN)
» time to have monitored how the pipeline
was running
Release and go

The next few iterations
Iterations continue, releasing every 2-3
weeks :)
Until it all broke :(

The Broken Pipeline Iteration
Up until now, the pipeline had been
behaving itself.
New analysis code came from our
supplier, our R&D team would test, then
I would throw the switch and release.

However, they changed something we
didn't find in testing.
Runs with multiplexed lanes broke, as
they have an extra 'barcode' read

Luckily, here is where being agile really
helped.
Whilst I had just 'scrummed' to decide
my priorities, I just dropped them
New Priority – Fix the Pipeline

Pluggable, so could a function or two be
moved to help?
Yes! 1 function move would halve the
problem.
Run on example – expected outcome

Now to fix the 3 read / 2 read problem
Again, write tests, test, code, test, run on
example, write tests for bugs, test, code,
test, run on example ....
End of this iteration, able to release a
fully fixed pipeline

Evaluation:
Being Agile, both in project management
and design, helped here.
How?

Design:
Plugin design of the pipeline - half the
problem was solved just by moving
something.
The other part just by writing a new
module.
It just worked!

Project Management:
Changing an iterations priorities so that
the urgently required fix could be
done...
...barely disrupting the flow of work on
feature requests

What has happened since?
Development has settled into a 2-3
week release cycle
Team knows development position
Made it easier for them to cover me

Acknowledgements
David Jackson
Guoying Qi
John O'Brien
Marina
Gourtovaia
Sri Deevi
Tom Skelly
Irina Abnizova
Steve Leonard
Tony Cox
You

Contact Me!
http://software-east.net/profile/AndyBrown
setitesuk@gmail.com
http://vampiresoftware.blogspot.com
http://twitter.com/setitesuk
http://www.slideshare.net/setitesuk
http://github.com/setitesuk

Agile analysis development

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Agile analysis development (20)

Recently uploaded (20)

Agile analysis development

Editor's Notes