SlideShare a Scribd company logo
A meta-analysis of computational biology benchmarks
reveals predictors of programming accuracy
Paul Gardner
University of Canterbury
Christchurch
New Zealand
Hard work from...
ResBaz
I want to say a big thank you to the organisors of ResBaz and NeSI and
Aleksandra and...!
Everything you are about to see is built using tools you have learned at
ResBaz...
Warning: the following research is a work in progress, conclusions may
change (after I’ve triple-checked data & claims)
{ }
Pretend we want to build a phylogenetic tree...
Building trees...
Bioinformaticians are bad, impatient & intolerant people!
Once you have gathered your data, you are faced with a problem...
Parsimony (useful if we want to publish in Cladistics)
47 methods
ARB FootPrinter LVB Parsimov POY
Bionumerics Freqpars MALIGN PAST PRAP
BIRCH Gambit MEGA PAUP* PSODA
Bosque GAPars Mesquite PAUPRat RA
BPAnalysis GelCompar-II Murka PaupUp SeaView
CAFCA GeneTree Network phangorn SeqState
CRANN gmaes NimbleTree PHYLIP Simplot
DAMBE Hennig86 NONA PhyloNet sog
EMBOSS IDEA Notung Phylo_win TCS
TNT
Felsenstein http://evolution.genetics.washington.edu/phylip/software.html
Building trees...
Maximum likelihood
97 methods
ALIFRITZ EMBOSS MOLPHY PHYLLAB rRNA-phylogeny
aLRT EREM MrAIC PhyloCoCo SeaView
ARB fastDNAml MrModeltest Phylo_win Segminator
Bio++ fastDNAmlRev MrMTgui PHYML SEMPHY
Bionumerics FASTML MultiPhyl PhyML-Multi SeqPup
BIRCH FastTree NEPAL PhyNav SeqState
BootPHYML GARLI NHML PHYSIG SIMMAP
Bosque GZ-Gamma nhPhyML PLATO Simplot
CodeAxe HY-PHY NimbleTree Porn* SLR
CoMET IQPNNI p4 PRAP Spectronet
Concaterpillar Kakusan4 PAL PROCOV Spectrum
CONSEL Leaphy PAML ProtTest SplitsTree
Crux Mac5 PARAT PTP SSA
DAMBE McRate PARBOOT r8s-bootstrap TipDate
DART Mesquite PASSML Rate4Site Treefinder
Darwin MetaPIGA PAUP* rate-evolution TREE-PUZZLE
dnarates MixtureTree PAUPRat RAxML Vanilla
DPRML Modelfit PaupUp raxmlGUI
DT-ModSel ModelGenerator phangorn RevDNArates
Building trees...
Bayesian methods
28 methods
AMBIORE BEST IMa2 p4 SIMMAP
ANC-GENE Bio++ Mesquite PAL tracer
BAli-Phy bms_runner MrBayes PAML Vanilla
BAMBE burntrees MrBayesPlugin PHASE
BayesPhylogenies Cadence MrBayes-tree-scanners PHYLLAB
BEAST Crux Multidivtime PhyloBayes
Felsenstein http://evolution.genetics.washington.edu/phylip/software.html
How can we choose software?
Which of the 172 methods do you use?
Can we trust the authors of software?
We can read all the manuscripts & manuals describing 172 software
packages. But...
How should we choose software?
Some possibilities (assuming you don’t create another method...)
Do you know the developer? Are they famous?
Select the most recently published tool?
Has the software been widely adopted?
Is it published in a good journal?
Is the software fast?
We could test the software...
Neutral comparison studies (a.k.a. benchmarks)
A. The main focus of the article is the comparison itself.
B. The authors should be reasonably neutral.
C. The evaluation criteria, methods, and data sets should be chosen in a
rational way.
Try approaching software like a scientist
Are any good controls available?
Positive: databases, publications,
simulation, ...
Negative: randomized, select
relevant negative data, ...
Some common accuracy metrics:
Sensitivity (true positive rate)
Specificity (true negative rate)
Mathew’s correlation coefficients
Area under an ROC curve
False positive rateTruepositiverate
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Pfam
Treefam
Custom
PROVEAN
Polyphen−2
FATHMM
FATHMM, unweighted
Wheeler et al. (2016) A Profile-Based Method for
Measuring the Impact of Genetic Variation. bioRxiv.
Benchmarks are useful, and fun...
Tools can be slow and inaccurate!
CLARK
Kraken
OneCodex
LMAT
MG−RAST
MetaPhlAn
mOTU
Genometa
QIIME
EBI
MetaPhyler
MEGAN
taxator−tk
GOTTCHA
A) Sum of log odds scores, phylum level
Deviation
0
10
20
30
40
50
0
5
10
15
Log2ofruntime(minutes)
~30 mins
~17 hrs
~23 days
Is there really a relationship between speed & accuracy?
Can we run a meta-analysis of bioinformatic benchmarks
What factors are predictive of accuracy?
Training articles:
initially 10 (historical knowledge)
Candidate articles:
((bioinformatics) AND (algorithmic OR algorithms OR biotechnologies OR
computational OR kernel OR methods OR procedure OR programs OR software
OR technologies)) AND (accuracy OR analysis OR assessment OR benchmark
OR benchmarking OR biases OR comparing OR comparison OR comparisons OR
comprehensive OR effectiveness OR estimation OR evaluation OR metrics
OR efficiency OR performance OR perspective OR quality OR rated OR
robust OR strengths OR suitable OR suitability OR superior OR survey OR
weaknesses) AND (benchmark OR competing OR complexity OR cputime OR
duration OR fast OR faster OR perform OR performance OR slow OR speed
OR time)
568,130 articles
Background articles:
(bioinformatics [TIAB] 2013:2015 [dp]) #sorted on first author
154,485 articles
Hunting for relevant articles
After trying Abstrackr (& getting annoyed)...
Training
articles
Background
articles
Removehighfreq. words
Computeword&di-wordfreqs
Computeword
scores: lo(word) =
log2




ftraining(word)+δ
fbackground(word)+δ




logOdds tnFreq bgFreq word
5.28 0.0019 0.0000 benchmarking
5.21 0.0061 0.0002 benchmark
4.91 0.0011 0.0000 noisy
4.85 0.0022 0.0001 metrics
4.85 0.0003 0.0000 encouragingly
...
-7.90 0.0000 0.0024 disease
-8.02 0.0000 0.0026 associated
-8.09 0.0000 0.0027 mirnas
Score&rankcandi-
datearticles: i lo(wi)
Candidate
articles
Manually
evaluate
high
scoring
articles
noyes
Buildmodel
Word and article scores
Can use the same scoring scheme for words that we use for scoring
biological sequences...
logOdds(word) = log2
ftraining (word)+δ
fbackground (word)+δ
articleScore = word∈article logOdds(word)
expression
mirnas
associated
patients
binding
mirna
expressed
network
involved
regulated
levels
revealed
database
mutations
drug
response
tumor
system
activity
induced
.
.
.
benchmarking
sequencers
benchtop
merits
correctness
benchmark
kernels
convolution
winner
supertree
structal
seeker
choosing
corpora
supermatrix
phenocopy
epistasis
segmod
encad
balibase
head & tail word scores
wordscore(bits)
−10
−5
0
5
Iteratively checking articles...
1. Score and rank candidate articles
2. Check the highest scoring articles, add to either training or background
articles
3. Return to 1.
So far we have...
found 35 matching articles. Manually extracted ranks, IF, H, ...
84 benchmarks (method accuracies and speeds)
203 bioinformatic methods
63 journals (47 Bioinformatics, 17 BMC bioinformatics, ...)
124 author GoogleScholar profiles
abyss bwasw dialigntx gossamer mafftfftns2 mpest paralign repeatfinder seqmap ssake velvet
antepiseeker caml diffsplice gottcha mafftlinsi mpjclustalw pass repeatgluer sga ssap wmrpmp
apg camp diginormvelvet greedyft maq mpsclustalw perm repeatscout sharcgs ssearch woodhams
barry ce dima gsnap mats mrfast phylonetft rmap shrimp ssm wublast
bfast celera djigsaw heidge megan mrpml piler rnacofold simulatedannealing sst xalign
bismark clark downhillsimplex hmmer metaphlan mrpmp poa rnaduplex sl st xcmswithcorrection
biss clc dsgseq idbaud metaphyler mrsfast poy rnahybrid smalt starbeast xcmswithoutretentiontime
boost clustalomega ebi igtpduplossft mgrast msinspect poystar rnaplex snap strcutal zema
bowtie clustalw edenanonstrict inchworm minia multalin pragcz rnaup snpruler swissmodel
bowtie2 comus edenastrict infernal mira muscle probalign rsearch soap taipan
bratbw coprarna edit intarna mlclustalw musclemaxiters probcons rsmatch soap2 targetrna
bsmap cosine epimode kalign mlclustalwquicktree mzmine probtree sam soapdenovo targetrna2
bsseeker cro erpin kbsps mlmafft ncbiblast pso sate spades taxatortk
buckycon cufflinks fa kraken mlmafftparttree nest pt scro sparse tcoffee
buckymrbayes dali fasta kthse mlmuscle newbler qiime scwrl sparseassembler team
buckymrbayesspa de fasttree leidnl mlopal novoalign qsra scwrlcons spcomp tmap
buckypop dexseq gassst lmat mlprankgt oases ravenna segemehl specarray transabyss
buckyraxml dialign genometa lsqman modellerv onecodex raxml segmodencad spt trinity
builder dialign22 gojobori mafft mosaik openms raxmllimited seqgsea srmapper upmes
bwa dialignt goldman mafftfftns motu pairfold rdiffparam seqman ssaha vcake
Possible predictors of accuracy...
Number of citations
#citations
Frequency
0
5
10
15
20
1 10 100 1,000 10,000 100,000
Journal impact factor
journal.IF
Frequency
0
10
20
30
40
50
60
0.5 1 2.5 5 10 25 50
Journal H5 index (GoogleScholar)
journal.H5
Frequency
0
10
20
30
40
50
60
10 25 50 100 250 500
Corresponding Author's H−index
author.H
Frequency
0
5
10
15
5 10 25 50 100 150
Corresponding Author's M−index
author.M
Frequency
2 4 6 8
0
5
10
15
20
25
30
Relative age
Relative age
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0
5
10
15
20
25
30
I have found no *significant* predictors accuracy!
Z = −1.52; p = 0.94author.M
author.H
journal.H5
relative
age
speed
#citations
journal.IF
Correlations with accuracy rank
Spearman'srho
−0.10
−0.05
0.00
0.05
0.10
Accuracy vs. Speed
mean normalised speed rank
meannormalisedaccuracyrank
0.2
0.4
0.6
0.8
1.0
1.2
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
* **
*
*
*
*
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o o
x
x
x
x
x
x
x
x
x x
x
xx
x
x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
x
xx
x
xx
x
x
x
x
x
x
x
x x
* = hi profile journal; o = hi profile author; x = hi cited
fast+accurate
fast+inaccurateslow+inaccurate
slow+accurate
IF & #citations
IF: Spearman’s ρ = 0.104; p-value = 0.20
#cites: Spearman’s ρ = 0.101; p-value = 0.18
Accuracy vs. IF
Journal impact factor
meannormalisedaccuracyrank
0.0
0.5
1.0
1.5
0.5
1
2.5
5
10
25
50
1.0
0.8
0.6
0.4
0.2
0.0
Accuracy vs. #citations
# citations
meannormalisedaccuracyrank
0.0
0.2
0.4
0.6
0.8
1.0
1
10
100
1,000
10,000
100,000
1.0
0.8
0.6
0.4
0.2
0.0
Conclusions
Nothing appears to be predictive of accuracy1
Fast software undergoes more developmental iterations
Can heuristic approaches produces a better result than mathematically
complete approaches?
It doesn’t appear to matter how famous you are, the journals you
publish in, whether you’re early or late or often your work is cited, you
can still write great software!
1
There is still a chance I have screwed something up...
Thanks
Stephanie McGimpsey
Fatemeh Ashari Ghomi
Sinan Uur Umu
Funded by: Rutherford Discovery Fellowship, BPRC and Biological Heritage: National Science Challenge.

More Related Content

PPTX
VB2015 Malware Classification meets crowd-sourcing
PPTX
Fighting Chaos The Nature of Geometry
PPTX
Fractals.pptx
PDF
Maths and nature english
PPTX
Matematika symetry in nature
PPTX
Mathematics & Nature
PPTX
Mathematics in nature
VB2015 Malware Classification meets crowd-sourcing
Fighting Chaos The Nature of Geometry
Fractals.pptx
Maths and nature english
Matematika symetry in nature
Mathematics & Nature
Mathematics in nature

Viewers also liked (15)

PPT
Guess My Photo
PPTX
The nature of probability and statistics
PDF
Seeing Math Patterns in Nature
PPT
Nature of photojournalism
PPTX
math bio for 1st year math students
PPTX
Math in nature
PPTX
Maths in nature
PPTX
A systems biology approach reveals the physiological origin of increased plas...
PPTX
Nature and principles of teaching and learning math
PPTX
Systems medicine and metabolic profiling of diseases
PPSX
Nature, characteristics and definition of maths
PPTX
Maths in nature (complete)
PPTX
Mathematics in nature
PPTX
Notes on curriculum concepts nature and purposes
PPT
math in daily life
Guess My Photo
The nature of probability and statistics
Seeing Math Patterns in Nature
Nature of photojournalism
math bio for 1st year math students
Math in nature
Maths in nature
A systems biology approach reveals the physiological origin of increased plas...
Nature and principles of teaching and learning math
Systems medicine and metabolic profiling of diseases
Nature, characteristics and definition of maths
Maths in nature (complete)
Mathematics in nature
Notes on curriculum concepts nature and purposes
math in daily life
Ad

Similar to A meta-analysis of computational biology benchmarks reveals predictors of programming accuracy (20)

ODP
The roles communities play in improving bioinformatics: better software, bett...
PDF
Introduction to Bioinformatics
PPTX
2014 toronto-torbug
PPTX
2015 genome-center
PPTX
Thesis ppt
PDF
Lightweight data engineering, tools, and software to facilitate data reuse an...
PDF
University of Manchester Symposium 2012: Extraction and Representation of in ...
PPTX
Joe parker-benchmarking-bioinformatics
PPT
Feasting On Brains With Taverna Public
PDF
(Ebook) Bioinformatics: A Practical Approach by Shui Qing Ye ISBN 97815848881...
PDF
Computational of Bioinformatics
PPT
32_Nov07_MachineLear..
PDF
Introduction to Bioinformatics for Molecular Studies
PDF
ECCB 2014: Extracting patterns of database and software usage from the bioinf...
PPTX
Branch: An interactive, web-based tool for building decision tree classifiers
PPT
Bioinformatics MiRON
PPTX
2014 abic-talk
PDF
PPTX
Bioinformatics_1_ChenS.pptx
PPT
Softwares For Phylogentic Analysis
The roles communities play in improving bioinformatics: better software, bett...
Introduction to Bioinformatics
2014 toronto-torbug
2015 genome-center
Thesis ppt
Lightweight data engineering, tools, and software to facilitate data reuse an...
University of Manchester Symposium 2012: Extraction and Representation of in ...
Joe parker-benchmarking-bioinformatics
Feasting On Brains With Taverna Public
(Ebook) Bioinformatics: A Practical Approach by Shui Qing Ye ISBN 97815848881...
Computational of Bioinformatics
32_Nov07_MachineLear..
Introduction to Bioinformatics for Molecular Studies
ECCB 2014: Extracting patterns of database and software usage from the bioinf...
Branch: An interactive, web-based tool for building decision tree classifiers
Bioinformatics MiRON
2014 abic-talk
Bioinformatics_1_ChenS.pptx
Softwares For Phylogentic Analysis
Ad

More from Paul Gardner (20)

PDF
ppgardner-lecture07-genome-function.pdf
PDF
ppgardner-lecture06-homologysearch.pdf
PDF
ppgardner-lecture05-alignment-comparativegenomics.pdf
PDF
ppgardner-lecture04-annotation-comparativegenomics.pdf
PDF
ppgardner-lecture03-genomesize-complexity.pdf
PDF
Does RNA avoidance dictate protein expression level?
PDF
Machine learning methods
PDF
Clustering
PDF
Monte Carlo methods
PDF
The jackknife and bootstrap
PDF
Contingency tables
PDF
Regression (II)
PDF
Regression (I)
PDF
Analysis of covariation and correlation
PDF
Analysis of two samples
PDF
Analysis of single samples
PDF
Centrality and spread
PDF
Fundamentals of statistical analysis
PDF
Random RNA interactions control protein expression in prokaryotes
PDF
Avoidance of stochastic RNA interactions can be harnessed to control protein ...
ppgardner-lecture07-genome-function.pdf
ppgardner-lecture06-homologysearch.pdf
ppgardner-lecture05-alignment-comparativegenomics.pdf
ppgardner-lecture04-annotation-comparativegenomics.pdf
ppgardner-lecture03-genomesize-complexity.pdf
Does RNA avoidance dictate protein expression level?
Machine learning methods
Clustering
Monte Carlo methods
The jackknife and bootstrap
Contingency tables
Regression (II)
Regression (I)
Analysis of covariation and correlation
Analysis of two samples
Analysis of single samples
Centrality and spread
Fundamentals of statistical analysis
Random RNA interactions control protein expression in prokaryotes
Avoidance of stochastic RNA interactions can be harnessed to control protein ...

Recently uploaded (20)

PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
Cell Membrane: Structure, Composition & Functions
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PDF
The scientific heritage No 166 (166) (2025)
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
Microbiology with diagram medical studies .pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPT
protein biochemistry.ppt for university classes
PDF
An interstellar mission to test astrophysical black holes
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
famous lake in india and its disturibution and importance
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
TOTAL hIP ARTHROPLASTY Presentation.pptx
neck nodes and dissection types and lymph nodes levels
Cell Membrane: Structure, Composition & Functions
POSITIONING IN OPERATION THEATRE ROOM.ppt
Comparative Structure of Integument in Vertebrates.pptx
INTRODUCTION TO EVS | Concept of sustainability
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
The scientific heritage No 166 (166) (2025)
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Introduction to Fisheries Biotechnology_Lesson 1.pptx
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
Microbiology with diagram medical studies .pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
protein biochemistry.ppt for university classes
An interstellar mission to test astrophysical black holes
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
famous lake in india and its disturibution and importance
Derivatives of integument scales, beaks, horns,.pptx
Classification Systems_TAXONOMY_SCIENCE8.pptx

A meta-analysis of computational biology benchmarks reveals predictors of programming accuracy

  • 1. A meta-analysis of computational biology benchmarks reveals predictors of programming accuracy Paul Gardner University of Canterbury Christchurch New Zealand
  • 3. ResBaz I want to say a big thank you to the organisors of ResBaz and NeSI and Aleksandra and...! Everything you are about to see is built using tools you have learned at ResBaz... Warning: the following research is a work in progress, conclusions may change (after I’ve triple-checked data & claims) { }
  • 4. Pretend we want to build a phylogenetic tree...
  • 5. Building trees... Bioinformaticians are bad, impatient & intolerant people! Once you have gathered your data, you are faced with a problem... Parsimony (useful if we want to publish in Cladistics) 47 methods ARB FootPrinter LVB Parsimov POY Bionumerics Freqpars MALIGN PAST PRAP BIRCH Gambit MEGA PAUP* PSODA Bosque GAPars Mesquite PAUPRat RA BPAnalysis GelCompar-II Murka PaupUp SeaView CAFCA GeneTree Network phangorn SeqState CRANN gmaes NimbleTree PHYLIP Simplot DAMBE Hennig86 NONA PhyloNet sog EMBOSS IDEA Notung Phylo_win TCS TNT Felsenstein http://evolution.genetics.washington.edu/phylip/software.html
  • 6. Building trees... Maximum likelihood 97 methods ALIFRITZ EMBOSS MOLPHY PHYLLAB rRNA-phylogeny aLRT EREM MrAIC PhyloCoCo SeaView ARB fastDNAml MrModeltest Phylo_win Segminator Bio++ fastDNAmlRev MrMTgui PHYML SEMPHY Bionumerics FASTML MultiPhyl PhyML-Multi SeqPup BIRCH FastTree NEPAL PhyNav SeqState BootPHYML GARLI NHML PHYSIG SIMMAP Bosque GZ-Gamma nhPhyML PLATO Simplot CodeAxe HY-PHY NimbleTree Porn* SLR CoMET IQPNNI p4 PRAP Spectronet Concaterpillar Kakusan4 PAL PROCOV Spectrum CONSEL Leaphy PAML ProtTest SplitsTree Crux Mac5 PARAT PTP SSA DAMBE McRate PARBOOT r8s-bootstrap TipDate DART Mesquite PASSML Rate4Site Treefinder Darwin MetaPIGA PAUP* rate-evolution TREE-PUZZLE dnarates MixtureTree PAUPRat RAxML Vanilla DPRML Modelfit PaupUp raxmlGUI DT-ModSel ModelGenerator phangorn RevDNArates
  • 7. Building trees... Bayesian methods 28 methods AMBIORE BEST IMa2 p4 SIMMAP ANC-GENE Bio++ Mesquite PAL tracer BAli-Phy bms_runner MrBayes PAML Vanilla BAMBE burntrees MrBayesPlugin PHASE BayesPhylogenies Cadence MrBayes-tree-scanners PHYLLAB BEAST Crux Multidivtime PhyloBayes Felsenstein http://evolution.genetics.washington.edu/phylip/software.html
  • 8. How can we choose software? Which of the 172 methods do you use?
  • 9. Can we trust the authors of software? We can read all the manuscripts & manuals describing 172 software packages. But...
  • 10. How should we choose software? Some possibilities (assuming you don’t create another method...) Do you know the developer? Are they famous? Select the most recently published tool? Has the software been widely adopted? Is it published in a good journal? Is the software fast? We could test the software...
  • 11. Neutral comparison studies (a.k.a. benchmarks) A. The main focus of the article is the comparison itself. B. The authors should be reasonably neutral. C. The evaluation criteria, methods, and data sets should be chosen in a rational way.
  • 12. Try approaching software like a scientist Are any good controls available? Positive: databases, publications, simulation, ... Negative: randomized, select relevant negative data, ... Some common accuracy metrics: Sensitivity (true positive rate) Specificity (true negative rate) Mathew’s correlation coefficients Area under an ROC curve False positive rateTruepositiverate 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 Pfam Treefam Custom PROVEAN Polyphen−2 FATHMM FATHMM, unweighted Wheeler et al. (2016) A Profile-Based Method for Measuring the Impact of Genetic Variation. bioRxiv.
  • 14. Tools can be slow and inaccurate! CLARK Kraken OneCodex LMAT MG−RAST MetaPhlAn mOTU Genometa QIIME EBI MetaPhyler MEGAN taxator−tk GOTTCHA A) Sum of log odds scores, phylum level Deviation 0 10 20 30 40 50 0 5 10 15 Log2ofruntime(minutes) ~30 mins ~17 hrs ~23 days
  • 15. Is there really a relationship between speed & accuracy? Can we run a meta-analysis of bioinformatic benchmarks What factors are predictive of accuracy? Training articles: initially 10 (historical knowledge) Candidate articles: ((bioinformatics) AND (algorithmic OR algorithms OR biotechnologies OR computational OR kernel OR methods OR procedure OR programs OR software OR technologies)) AND (accuracy OR analysis OR assessment OR benchmark OR benchmarking OR biases OR comparing OR comparison OR comparisons OR comprehensive OR effectiveness OR estimation OR evaluation OR metrics OR efficiency OR performance OR perspective OR quality OR rated OR robust OR strengths OR suitable OR suitability OR superior OR survey OR weaknesses) AND (benchmark OR competing OR complexity OR cputime OR duration OR fast OR faster OR perform OR performance OR slow OR speed OR time) 568,130 articles Background articles: (bioinformatics [TIAB] 2013:2015 [dp]) #sorted on first author 154,485 articles
  • 16. Hunting for relevant articles After trying Abstrackr (& getting annoyed)... Training articles Background articles Removehighfreq. words Computeword&di-wordfreqs Computeword scores: lo(word) = log2     ftraining(word)+δ fbackground(word)+δ     logOdds tnFreq bgFreq word 5.28 0.0019 0.0000 benchmarking 5.21 0.0061 0.0002 benchmark 4.91 0.0011 0.0000 noisy 4.85 0.0022 0.0001 metrics 4.85 0.0003 0.0000 encouragingly ... -7.90 0.0000 0.0024 disease -8.02 0.0000 0.0026 associated -8.09 0.0000 0.0027 mirnas Score&rankcandi- datearticles: i lo(wi) Candidate articles Manually evaluate high scoring articles noyes Buildmodel
  • 17. Word and article scores Can use the same scoring scheme for words that we use for scoring biological sequences... logOdds(word) = log2 ftraining (word)+δ fbackground (word)+δ articleScore = word∈article logOdds(word) expression mirnas associated patients binding mirna expressed network involved regulated levels revealed database mutations drug response tumor system activity induced . . . benchmarking sequencers benchtop merits correctness benchmark kernels convolution winner supertree structal seeker choosing corpora supermatrix phenocopy epistasis segmod encad balibase head & tail word scores wordscore(bits) −10 −5 0 5
  • 18. Iteratively checking articles... 1. Score and rank candidate articles 2. Check the highest scoring articles, add to either training or background articles 3. Return to 1.
  • 19. So far we have... found 35 matching articles. Manually extracted ranks, IF, H, ... 84 benchmarks (method accuracies and speeds) 203 bioinformatic methods 63 journals (47 Bioinformatics, 17 BMC bioinformatics, ...) 124 author GoogleScholar profiles abyss bwasw dialigntx gossamer mafftfftns2 mpest paralign repeatfinder seqmap ssake velvet antepiseeker caml diffsplice gottcha mafftlinsi mpjclustalw pass repeatgluer sga ssap wmrpmp apg camp diginormvelvet greedyft maq mpsclustalw perm repeatscout sharcgs ssearch woodhams barry ce dima gsnap mats mrfast phylonetft rmap shrimp ssm wublast bfast celera djigsaw heidge megan mrpml piler rnacofold simulatedannealing sst xalign bismark clark downhillsimplex hmmer metaphlan mrpmp poa rnaduplex sl st xcmswithcorrection biss clc dsgseq idbaud metaphyler mrsfast poy rnahybrid smalt starbeast xcmswithoutretentiontime boost clustalomega ebi igtpduplossft mgrast msinspect poystar rnaplex snap strcutal zema bowtie clustalw edenanonstrict inchworm minia multalin pragcz rnaup snpruler swissmodel bowtie2 comus edenastrict infernal mira muscle probalign rsearch soap taipan bratbw coprarna edit intarna mlclustalw musclemaxiters probcons rsmatch soap2 targetrna bsmap cosine epimode kalign mlclustalwquicktree mzmine probtree sam soapdenovo targetrna2 bsseeker cro erpin kbsps mlmafft ncbiblast pso sate spades taxatortk buckycon cufflinks fa kraken mlmafftparttree nest pt scro sparse tcoffee buckymrbayes dali fasta kthse mlmuscle newbler qiime scwrl sparseassembler team buckymrbayesspa de fasttree leidnl mlopal novoalign qsra scwrlcons spcomp tmap buckypop dexseq gassst lmat mlprankgt oases ravenna segemehl specarray transabyss buckyraxml dialign genometa lsqman modellerv onecodex raxml segmodencad spt trinity builder dialign22 gojobori mafft mosaik openms raxmllimited seqgsea srmapper upmes bwa dialignt goldman mafftfftns motu pairfold rdiffparam seqman ssaha vcake
  • 20. Possible predictors of accuracy... Number of citations #citations Frequency 0 5 10 15 20 1 10 100 1,000 10,000 100,000 Journal impact factor journal.IF Frequency 0 10 20 30 40 50 60 0.5 1 2.5 5 10 25 50 Journal H5 index (GoogleScholar) journal.H5 Frequency 0 10 20 30 40 50 60 10 25 50 100 250 500 Corresponding Author's H−index author.H Frequency 0 5 10 15 5 10 25 50 100 150 Corresponding Author's M−index author.M Frequency 2 4 6 8 0 5 10 15 20 25 30 Relative age Relative age Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 30
  • 21. I have found no *significant* predictors accuracy! Z = −1.52; p = 0.94author.M author.H journal.H5 relative age speed #citations journal.IF Correlations with accuracy rank Spearman'srho −0.10 −0.05 0.00 0.05 0.10 Accuracy vs. Speed mean normalised speed rank meannormalisedaccuracyrank 0.2 0.4 0.6 0.8 1.0 1.2 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 * * * * * * * * * ** * * * * * * * * * * * * * * * * * * ** * * ** * * * * o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x xx x x x x x x x x x x xx x xx x x x xx x xx x x x x x x x x x * = hi profile journal; o = hi profile author; x = hi cited fast+accurate fast+inaccurateslow+inaccurate slow+accurate
  • 22. IF & #citations IF: Spearman’s ρ = 0.104; p-value = 0.20 #cites: Spearman’s ρ = 0.101; p-value = 0.18 Accuracy vs. IF Journal impact factor meannormalisedaccuracyrank 0.0 0.5 1.0 1.5 0.5 1 2.5 5 10 25 50 1.0 0.8 0.6 0.4 0.2 0.0 Accuracy vs. #citations # citations meannormalisedaccuracyrank 0.0 0.2 0.4 0.6 0.8 1.0 1 10 100 1,000 10,000 100,000 1.0 0.8 0.6 0.4 0.2 0.0
  • 23. Conclusions Nothing appears to be predictive of accuracy1 Fast software undergoes more developmental iterations Can heuristic approaches produces a better result than mathematically complete approaches? It doesn’t appear to matter how famous you are, the journals you publish in, whether you’re early or late or often your work is cited, you can still write great software! 1 There is still a chance I have screwed something up...
  • 24. Thanks Stephanie McGimpsey Fatemeh Ashari Ghomi Sinan Uur Umu Funded by: Rutherford Discovery Fellowship, BPRC and Biological Heritage: National Science Challenge.