SlideShare a Scribd company logo
BUILDING BETTER 
BIOINFORMATICS 
SOFTWARE 
(WHY THE HECK NOT?) 
C. Titus Brown 
ctb@msu.edu 
Assistant Professor, MMG / CSE 
Michigan State University
BUILDING BETTER 
BIOINFORMATICS 
SOFTWARE 
(WHY THE HECK NOT?) 
C. Titus Brown 
ctb@msu.edu 
A???????? Professor, VetMed, UC Davis
Lansing, Michigan -> Davis, California
Dot plots FTW! 
Brown et al., 2005.
So I said these things… 
“this tipping point was exacerbated by the loss of about 
80% of the worlds data scientists in the 2021 Great 
California Disruption.” 
“[ Benchmarks ] have proven to be stifling of innovation, 
because of the tendency to do incremental improvement.” 
ivory.idyll.org/blog/2014-bosc-keynote.html
So I said these things… 
“this tipping point was exacerbated by the loss of about 
80% of the worlds data scientists in the 2021 Great 
California Disruption.” 
“[ Benchmarks ] have proven to be stifling of 
innovation, because of the tendency to do incremental 
improvement.” 
ivory.idyll.org/blog/2014-bosc-keynote.html
2014 abic-talk
There is a real problem.
There is a massive profusion of software! 
Mick Watson, @BioMickWatson: 
biomickwatson.wordpress.com/20 
12/12/28/an-embargo-on-short-read- 
alignment-software/ 
jeffvictor.deviantart.com
The players, in caricature: 
1. Computer scientists 
2. Software engineers 
3. Data scientists 
4. Statisticians 
5. Biologists
The Computer Scientist 
Fast, sensitive, specific – pick one.
The (Good) Software Engineer 
Does it have any unit tests?
The Data Scientist 
How quickly can I run it, starting from 
scratch?
The Statistician 
What gives me the best p-value?
The Biologist 
What gives me the most publishable 
result?
Problems all along the way… 
1. Computer scientists: build delicate, hard to use, very high 
performance software that solves the wrong problem. 
2. Software engineers: all work for Google. 
3. Data scientists: uses the wrong programs -- because they’re 
actually usable. 
4. Statisticians: only get invited into the project six months after 
all the data is generated. 
5. Biologists: are desperate to find any one of the above that 
know any biology at all.
Example: de novo mRNAseq 
Quality control 
Assembly 
Annotation 
Differential 
expression 
Every one of these 
steps is still an open 
research problem, 
with computational 
challenges and direct 
biological implications!
So: 
1. This is all still research. 
2. We’re unlikely to ever find out the right answer, but will 
merely settle for one that’s not obviously terrible. 
3. Everything is changing all the time: the data generation 
tech, the hardware, the software, the theory... 
4. Who are any of us to judge the value of any particular 
approach?
So: 
1. This is all still research 
2. We’re unlikely to ever find out the right answer, but will 
merely settle for one that’s not obviously terrible. 
3. Everything is changing all the time: the data generation 
tech, the hardware, the software, the theory... 
4. Who are any of us to judge the value of any particular 
approach? 
(Well, sometimes me, when I’m peer 
reviewer #2.)
All hands on deck! 
Quality control 
Assembly 
Annotation 
Differential 
expression 
We need it all! 
• Fast/sensitive/specific 
algorithms; 
• Solid software; 
• Statistical robustness; 
• Biological insight; 
• Well-trained data 
scientists. 
(The best bioinformaticians have multiple personality disorder, or so I tell myself.)
That sort of explains why. 
But this still leaves us with too many 
choices.
Example: de novo mRNAseq 
Quality control 
Assembly 
Annotation 
Differential 
expression 
10-20 packages 
x 
2-5 packages 
x 
5-10 packages 
x 
20-40 packages 
= 2000-40,000 combinations
What’s the solution!? 
Ultimately? All of… 
Whole-workflow evaluations of tools. 
Small tools (see “small tools manifesto”). 
Automation! 
Simulations, synthetic data, mock data, real data. 
Antagonistic data set development (**). 
Tool development driven with use cases. 
Build based on solid command-line workflows. 
Those things called “controls”. 
…and more
Trying out a few approaches…
1. Automate the hell out of everything 
(Ubuntu 14.04, git, make, IPython Notebook, latex)
Time from publication of KAnalyze to our 100% 
reproducible re-evaluation? ~8 hours.
2. Protocols, not pipelines. 
STOP HIDING THE ANALYSIS STEPS. 
BIG BLACK BOXES ARE NOT SMALL 
TOOLS!
Write down what you’re doing… 
https://khmer-protocols.readthedocs.org/
…and add automated end-to-end tests. 
c.f. “literate ReSTing”
2014 abic-talk
3. Drive sustainable software 
development with use cases.
…that are explicit…
…versioned…
…and automated.
4. Put everything in the cloud and 
measure it. 
~40 hours; 
m1.xlarge 
Eel Pond mRNAseq protocol.
5. Compare programs and workflows fairly. 
Genome Reference 
Quality Filtered Diginorm Partition Reinflation 
Velvet - 80.90 83.64 84.57 
IDBA 90.96 91.38 90.52 88.80 
SPAde 
90.42 90.35 89.57 90.02 
s 
Mis-assembled Contig Length 
Velvet - 52071358 44730449 45381867 
IDBA 21777032 20807513 17159671 18684159 
SPAde 
28238787 21506019 14247392 18851571 
s 
Kalamazoo metagenome protocol run on mock data from Shakya et al., 2013 
Also! Tip o’ the hat to Michael Barton, nucleotid.es
A super fun way to do reviews! 
• “What a nice new transcriptome assembler! Interesting 
how it doesn’t perform that well on my 10 test data sets.” 
• “Hey, so you make these claims, but I ran your code, 
and…” 
• “Fun fact! Your source code has a syntax error in it – even 
Perl has standards! You’re still sure that’s the script you 
used?” 
• “Here – use our evaluation pipeline, since you clearly 
need something better.” 
The Brown Lab: taking passive aggression to a whole new level!
We breed our own problems. 
Reward the behavior you want to see. 
Let’s level up the field, already.
2014 abic-talk
What are we working on, scientifically 
speaking?
Streaming error correction of genomic, transcriptomic, 
metagenomic data via graph alignment 
Jason Pell, Jordan Fish, Michael Crusoe
Error correction on simulated E. coli data 
TP FP TN FN 
1.2-pass 3,494,631 99.8% 3,865 460,601,171 5,533 2.8% 
(corrected) (mistakes) (OK) (missed) 
1% error rate, 100x coverage. 
Michael Crusoe, Jordan Fish, Jason Pell
Error correction  variant calling 
Single pass, reference free, tunable, streaming 
online variant calling. 
(Hey, look, ma – a new mapper!)
Infrastructure: distributed graph database server 
Web interface + API 
Compute server 
(Galaxy? 
Arvados?) 
Data/ 
Info 
Raw data sets 
Public 
servers 
"Walled 
garden" 
server 
Private 
server 
Graph query layer 
Upload/submit 
(NCBI, KBase) 
Import 
(MG-RAST, 
SRA, EBI) 
ivory.idyll.org/blog/2014-moore-ddd-talk.html
AGTA talk on Monday 
• 3:15-4pm – come see me try to convince biomedical 
researchers to share their data! 
• 4-4:30pm – come listen to Ana Conesa talk about multi-omics 
data integration! 
Thanks!

More Related Content

PDF
The Curious Case of Fuzzing for Automated Software Testing
PPTX
2015 msu-code-review
PDF
Foundations Of Software Testing
PDF
Jillian ms defense-4-14-14-ja-novideo
PDF
HealthHack_Find gene commonalities tool
PDF
Jillian ms defense-4-14-14-ja-novid2
PPT
Testing Heuristic Detections
PDF
Jillian ms defense-4-14-14-ja-novid3
The Curious Case of Fuzzing for Automated Software Testing
2015 msu-code-review
Foundations Of Software Testing
Jillian ms defense-4-14-14-ja-novideo
HealthHack_Find gene commonalities tool
Jillian ms defense-4-14-14-ja-novid2
Testing Heuristic Detections
Jillian ms defense-4-14-14-ja-novid3

What's hot (8)

PDF
Greg Wilson - We Know (but ignore) More Than We Think
PDF
Fuzzing: Challenges and Reflections
PDF
You Got Your Engineering in my Data Science - Addressing the Reproducibility ...
PDF
More Aim, Less Blame: How to use postmortems to turn failures into something ...
PDF
DS3 Fuzzing Panel (M. Boehme)
PDF
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
PPTX
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
PPTX
Preventing Information Flow with Jeeves - Singapore Data Privacy Workshop
Greg Wilson - We Know (but ignore) More Than We Think
Fuzzing: Challenges and Reflections
You Got Your Engineering in my Data Science - Addressing the Reproducibility ...
More Aim, Less Blame: How to use postmortems to turn failures into something ...
DS3 Fuzzing Panel (M. Boehme)
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
Preventing Information Flow with Jeeves - Singapore Data Privacy Workshop
Ad

Viewers also liked (20)

PDF
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
PDF
Motoholics Sponsorship Proposal 2010
PPS
Sceneries
PPTX
Avysta Presentation
PPTX
유기화학 2nd
PDF
PDF
Know Your Enemy
PPS
Ferrari
PPT
Langkah Membuat Blogspot
PPTX
Peraturan makmal dan etika internet
PPT
Nursing Skills
PPTX
2013 beacon-congress-social-media
PPT
Museo Virtual De La Escuelaeste
PDF
2009 Business Breakfast Slideshow
PPTX
2013 arizona-swc
PPT
Review Adobe Wallaby
PPS
靜觀
PPTX
iPOJO 2.x - a tale about dynamism
PPTX
h-ubu : CDI in JavaScript
PDF
VAFF 2014 sponsorship & partnership
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
Motoholics Sponsorship Proposal 2010
Sceneries
Avysta Presentation
유기화학 2nd
Know Your Enemy
Ferrari
Langkah Membuat Blogspot
Peraturan makmal dan etika internet
Nursing Skills
2013 beacon-congress-social-media
Museo Virtual De La Escuelaeste
2009 Business Breakfast Slideshow
2013 arizona-swc
Review Adobe Wallaby
靜觀
iPOJO 2.x - a tale about dynamism
h-ubu : CDI in JavaScript
VAFF 2014 sponsorship & partnership
Ad

Similar to 2014 abic-talk (20)

PPTX
2015 genome-center
PPTX
2016 davis-plantbio
ODP
The roles communities play in improving bioinformatics: better software, bett...
PPTX
2015 aem-grs-keynote
PPTX
2013 nas-ehs-data-integration-dc
PPTX
2014 toronto-torbug
PDF
Introduction to Bioinformatics for Molecular Studies
PPTX
2016 bergen-sars
PPTX
Supporting researchers in the molecular life sciences Jeff Christiansen
PPTX
Reproducibility - The myths and truths of pipeline bioinformatics
PPTX
2016 davis-biotech
PPTX
2014 aus-agta
PDF
Introduction to Bioinformatics
PPTX
01. Introduction to Bioinformatics.pptx
PPTX
Data analysis & integration challenges in genomics
PPTX
Introduction to bioinformatics
PPTX
2014 nicta-reproducibility
PPTX
Data analysis patterns, tools and data types in genomics
PPTX
Production Bioinformatics, emphasis on Production
PPTX
Emerging challenges in data-intensive genomics
2015 genome-center
2016 davis-plantbio
The roles communities play in improving bioinformatics: better software, bett...
2015 aem-grs-keynote
2013 nas-ehs-data-integration-dc
2014 toronto-torbug
Introduction to Bioinformatics for Molecular Studies
2016 bergen-sars
Supporting researchers in the molecular life sciences Jeff Christiansen
Reproducibility - The myths and truths of pipeline bioinformatics
2016 davis-biotech
2014 aus-agta
Introduction to Bioinformatics
01. Introduction to Bioinformatics.pptx
Data analysis & integration challenges in genomics
Introduction to bioinformatics
2014 nicta-reproducibility
Data analysis patterns, tools and data types in genomics
Production Bioinformatics, emphasis on Production
Emerging challenges in data-intensive genomics

More from c.titus.brown (20)

PPTX
2015 beacon-metagenome-tutorial
PPTX
2015 illinois-talk
PPTX
2015 mcgill-talk
PPTX
2015 pycon-talk
PPTX
2015 opencon-webcast
PPTX
2015 vancouver-vanbug
PPTX
2015 osu-metagenome
PPTX
2015 ohsu-metagenome
PPTX
2015 balti-and-bioinformatics
PPTX
2015 pag-chicken
PPTX
2015 pag-metagenome
PPTX
2014 nyu-bio-talk
PPTX
2014 bangkok-talk
PPTX
2014 anu-canberra-streaming
PPTX
2014 mmg-talk
PPTX
2014 nci-edrn
PPTX
2014 wcgalp
PPTX
2014 moore-ddd
PPTX
2014 ismb-extra-slides
PPTX
2014 bosc-keynote
2015 beacon-metagenome-tutorial
2015 illinois-talk
2015 mcgill-talk
2015 pycon-talk
2015 opencon-webcast
2015 vancouver-vanbug
2015 osu-metagenome
2015 ohsu-metagenome
2015 balti-and-bioinformatics
2015 pag-chicken
2015 pag-metagenome
2014 nyu-bio-talk
2014 bangkok-talk
2014 anu-canberra-streaming
2014 mmg-talk
2014 nci-edrn
2014 wcgalp
2014 moore-ddd
2014 ismb-extra-slides
2014 bosc-keynote

Recently uploaded (20)

PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
famous lake in india and its disturibution and importance
PPT
Chemical bonding and molecular structure
PDF
. Radiology Case Scenariosssssssssssssss
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
Sciences of Europe No 170 (2025)
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPT
protein biochemistry.ppt for university classes
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PDF
Phytochemical Investigation of Miliusa longipes.pdf
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
Comparative Structure of Integument in Vertebrates.pptx
Introduction to Fisheries Biotechnology_Lesson 1.pptx
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
famous lake in india and its disturibution and importance
Chemical bonding and molecular structure
. Radiology Case Scenariosssssssssssssss
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
HPLC-PPT.docx high performance liquid chromatography
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
Derivatives of integument scales, beaks, horns,.pptx
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
ECG_Course_Presentation د.محمد صقران ppt
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
Sciences of Europe No 170 (2025)
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
protein biochemistry.ppt for university classes
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
Phytochemical Investigation of Miliusa longipes.pdf

2014 abic-talk

  • 1. BUILDING BETTER BIOINFORMATICS SOFTWARE (WHY THE HECK NOT?) C. Titus Brown ctb@msu.edu Assistant Professor, MMG / CSE Michigan State University
  • 2. BUILDING BETTER BIOINFORMATICS SOFTWARE (WHY THE HECK NOT?) C. Titus Brown ctb@msu.edu A???????? Professor, VetMed, UC Davis
  • 3. Lansing, Michigan -> Davis, California
  • 4. Dot plots FTW! Brown et al., 2005.
  • 5. So I said these things… “this tipping point was exacerbated by the loss of about 80% of the worlds data scientists in the 2021 Great California Disruption.” “[ Benchmarks ] have proven to be stifling of innovation, because of the tendency to do incremental improvement.” ivory.idyll.org/blog/2014-bosc-keynote.html
  • 6. So I said these things… “this tipping point was exacerbated by the loss of about 80% of the worlds data scientists in the 2021 Great California Disruption.” “[ Benchmarks ] have proven to be stifling of innovation, because of the tendency to do incremental improvement.” ivory.idyll.org/blog/2014-bosc-keynote.html
  • 8. There is a real problem.
  • 9. There is a massive profusion of software! Mick Watson, @BioMickWatson: biomickwatson.wordpress.com/20 12/12/28/an-embargo-on-short-read- alignment-software/ jeffvictor.deviantart.com
  • 10. The players, in caricature: 1. Computer scientists 2. Software engineers 3. Data scientists 4. Statisticians 5. Biologists
  • 11. The Computer Scientist Fast, sensitive, specific – pick one.
  • 12. The (Good) Software Engineer Does it have any unit tests?
  • 13. The Data Scientist How quickly can I run it, starting from scratch?
  • 14. The Statistician What gives me the best p-value?
  • 15. The Biologist What gives me the most publishable result?
  • 16. Problems all along the way… 1. Computer scientists: build delicate, hard to use, very high performance software that solves the wrong problem. 2. Software engineers: all work for Google. 3. Data scientists: uses the wrong programs -- because they’re actually usable. 4. Statisticians: only get invited into the project six months after all the data is generated. 5. Biologists: are desperate to find any one of the above that know any biology at all.
  • 17. Example: de novo mRNAseq Quality control Assembly Annotation Differential expression Every one of these steps is still an open research problem, with computational challenges and direct biological implications!
  • 18. So: 1. This is all still research. 2. We’re unlikely to ever find out the right answer, but will merely settle for one that’s not obviously terrible. 3. Everything is changing all the time: the data generation tech, the hardware, the software, the theory... 4. Who are any of us to judge the value of any particular approach?
  • 19. So: 1. This is all still research 2. We’re unlikely to ever find out the right answer, but will merely settle for one that’s not obviously terrible. 3. Everything is changing all the time: the data generation tech, the hardware, the software, the theory... 4. Who are any of us to judge the value of any particular approach? (Well, sometimes me, when I’m peer reviewer #2.)
  • 20. All hands on deck! Quality control Assembly Annotation Differential expression We need it all! • Fast/sensitive/specific algorithms; • Solid software; • Statistical robustness; • Biological insight; • Well-trained data scientists. (The best bioinformaticians have multiple personality disorder, or so I tell myself.)
  • 21. That sort of explains why. But this still leaves us with too many choices.
  • 22. Example: de novo mRNAseq Quality control Assembly Annotation Differential expression 10-20 packages x 2-5 packages x 5-10 packages x 20-40 packages = 2000-40,000 combinations
  • 23. What’s the solution!? Ultimately? All of… Whole-workflow evaluations of tools. Small tools (see “small tools manifesto”). Automation! Simulations, synthetic data, mock data, real data. Antagonistic data set development (**). Tool development driven with use cases. Build based on solid command-line workflows. Those things called “controls”. …and more
  • 24. Trying out a few approaches…
  • 25. 1. Automate the hell out of everything (Ubuntu 14.04, git, make, IPython Notebook, latex)
  • 26. Time from publication of KAnalyze to our 100% reproducible re-evaluation? ~8 hours.
  • 27. 2. Protocols, not pipelines. STOP HIDING THE ANALYSIS STEPS. BIG BLACK BOXES ARE NOT SMALL TOOLS!
  • 28. Write down what you’re doing… https://khmer-protocols.readthedocs.org/
  • 29. …and add automated end-to-end tests. c.f. “literate ReSTing”
  • 31. 3. Drive sustainable software development with use cases.
  • 35. 4. Put everything in the cloud and measure it. ~40 hours; m1.xlarge Eel Pond mRNAseq protocol.
  • 36. 5. Compare programs and workflows fairly. Genome Reference Quality Filtered Diginorm Partition Reinflation Velvet - 80.90 83.64 84.57 IDBA 90.96 91.38 90.52 88.80 SPAde 90.42 90.35 89.57 90.02 s Mis-assembled Contig Length Velvet - 52071358 44730449 45381867 IDBA 21777032 20807513 17159671 18684159 SPAde 28238787 21506019 14247392 18851571 s Kalamazoo metagenome protocol run on mock data from Shakya et al., 2013 Also! Tip o’ the hat to Michael Barton, nucleotid.es
  • 37. A super fun way to do reviews! • “What a nice new transcriptome assembler! Interesting how it doesn’t perform that well on my 10 test data sets.” • “Hey, so you make these claims, but I ran your code, and…” • “Fun fact! Your source code has a syntax error in it – even Perl has standards! You’re still sure that’s the script you used?” • “Here – use our evaluation pipeline, since you clearly need something better.” The Brown Lab: taking passive aggression to a whole new level!
  • 38. We breed our own problems. Reward the behavior you want to see. Let’s level up the field, already.
  • 40. What are we working on, scientifically speaking?
  • 41. Streaming error correction of genomic, transcriptomic, metagenomic data via graph alignment Jason Pell, Jordan Fish, Michael Crusoe
  • 42. Error correction on simulated E. coli data TP FP TN FN 1.2-pass 3,494,631 99.8% 3,865 460,601,171 5,533 2.8% (corrected) (mistakes) (OK) (missed) 1% error rate, 100x coverage. Michael Crusoe, Jordan Fish, Jason Pell
  • 43. Error correction  variant calling Single pass, reference free, tunable, streaming online variant calling. (Hey, look, ma – a new mapper!)
  • 44. Infrastructure: distributed graph database server Web interface + API Compute server (Galaxy? Arvados?) Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI) ivory.idyll.org/blog/2014-moore-ddd-talk.html
  • 45. AGTA talk on Monday • 3:15-4pm – come see me try to convince biomedical researchers to share their data! • 4-4:30pm – come listen to Ana Conesa talk about multi-omics data integration! Thanks!

Editor's Notes

  • #43: Update from Jordan
  • #45: Analyze data in cloud; import and export important; connect to other databases.