SlideShare a Scribd company logo
Microbial Genomics and
Bioinformatics
BM405
1.Introduction
Leighton Pritchard1,2,3
1
Information and Computational Sciences,
2
Centre for Human and Animal Pathogens in the Environment,
3
Dundee Effector Consortium,
The James Hutton Institute, Invergowrie, Dundee, Scotland, DD2 5DA
Acceptable Use Policy
Recording of this talk, taking photos, discussing the content using
email, Twitter, blogs, etc. is permitted (and encouraged),
providing distraction to others is minimised.
These slides will be made available on SlideShare.
These slides, and supporting material including exercises, are
available at https://github.com/widdowquinn/Teaching-
Strathclyde-BM405
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
The impacta
a
Loman and Pallen (2015) Nat. Rev. Micro. doi:10.1038/nrmicro3565
Genome sequencing and bioinformatics have transformed our
understanding of prokaryotic biology:
• function
• evolution
• interactions
• community structure
• real-time monitoring and diagnostics
• as a platform for synthetic biology
It now takes much longer to analyse than generate data
The endpoints
• 2003: Erwinia carotovora subsp. atroseptica
• 2015: Dickeya spp., Campylobacter spp., and Escherichia coli
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
2003: E. carotovora subsp. atroseptica
• £250k collaboration between SCRI, University of
Cambridge, WT Sanger Institute
• Single isolate: E. carotovora subsp. atroseptica SCRI1043
• The first sequenced enterobacterial plant pathogen (32
authors!) 1
• All repeats and gaps bridged and sequenced directly
• Result: a single, complete, high-quality 5Mbp circular
chromosome at 10.2X coverage: 106,500 reads
1
Bell et al. (2004) Proc. Natl. Acad. Sci. USA 101: 30:11105-11110. doi:10.1073/pnas.0402424101
2003: E. carotovora subsp. atroseptica
A genome sequence is a starting point. . .
• Manual annotation by the Sanger Pathogen Sequencing Unit
• Literature searches and comparisons
• Six people, for six months ≈ three person-years
• Genes: BLAST, GLIMMER, ORPHEUS
• Functional domains: PFAM, SIGNALP, TMHMM
• Metabolism: KEGG
• ncRNA: RFAM
2003: E. carotovora subsp. atroseptica
Working (Eca_Sanger_annotation.gbk) and published
(NC_004547.gbk) annotation files are in the data directory
2003: E. carotovora subsp. atroseptica
Compared against all 142 available bacterial genomes2
2
data/Pba directory in the accompanying GitHub repository
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
2013: Dickeya spp.
Sequenced and annotated 25 new isolates of Dickeya
• 25 Dickeya isolates, at least six species
• Multiple sequencing methods: 454, Illumina (SE, PE)
• Minor publications (6, 8 authors)3,4
• Results: 12-237 fragments containing 4.2-5.1Mbp, at 6-84X
coverage, 170k-4m reads
• Automated annotation: RAST with manual corrections
3
Pritchard et al. (2013) Genome Ann. 1 (4) doi:10.1128/genomeA.00087-12
4
Pritchard et al. (2013) Genome Ann. 1 (6) doi:10.1128/genomeA.00978-13
2013: Dickeya spp.
Within-genus comparisons: large-scale synteny and rearrangement
Within-species comparisons: e.g. indels, HGT
2013: Dickeya spp.
Within-genus comparisons: whole genome-based species
delineation5
5
van der Wolf et al. (2014) Int. J. Syst. Evol. Micr. 64:768-774 doi:10.1099/ijs.0.052944-0
2013: Dickeya spp.
Within-genus comparisons: differences in metabolism
2014: E. coli
Sequenced and annotated ≈ 190 isolates of E. coli
All bacteria environmental, sampled from lysimeters
• Illumina paired-end sequencing. Total cost of sequencing
190 bacteria: ≈£11k
• Automated annotation: PROKKA
2014: E. coli
Sequencing output variable - even though same preps, “same”
bacteria, similar sources.
• Results: 5-3000 contigs (median ≈ 125); 9kbp-7.1Mbp
(median ≈ 5Mbp); 170k-4m reads
2014: E. coli
Genome sequencing enables within-species classification
Brunei20070942_contigs
Muenster20063091_contigs
Senftenberg20070885_contigs
Lys142_contigs
Lys175_contigs
Lys130_contigs
Lys170_contigs
Lys126_contigs
Lys167_contigs
Lys176_contigs
Lys169_contigs
Lys50_contigs
X5038_contigs
Lys131_contigs
Lys171_contigs
Lys111_contigs
Lys107_contigs
Lys114_contigs
Lys16_contigs
Lys22_contigs
Lys65_contigs
Lys56_contigs
Lys113_contigs
Lys109_contigs
Lys77_contigs
Lys102_contigs
Lys100_contigs
Lys92_contigs
Lys94_contigs
Lys80_contigs
Lys64_contigs
Lys82_contigs
AW3_contigs
X5008_contigs
AW4_contigs
AW1_contigs
Lys118_contigs
Lys138_contigs
Lys121_contigs
Lys122_contigs
Lys177_contigs
Lys155_contigs
Lys165_contigs
Lys163_contigs
Lys160_contigs
Lys161_contigs
Lys172_contigs
Lys144_contigs
Lys135_contigs
Lys146_contigs
Lys123_contigs
Lys124_contigs
Lys150_contigs
Lys140_contigs
Lys157_contigs
Lys173_contigs
Lys156_contigs
Lys158_contigs
Lys159_contigs
Lys162_contigs
Lys5_contigs
X5084_contigs
X5042_contigs
Lys110_contigs
Lys136_contigs
Lys54_contigs
Lys1_contigs
Lys6_contigs
Lys112_contigs
X5012_contigs
Lys30_contigs
Lys25_contigs
Lys43_contigs
Lys37_contigs
Lys40_contigs
Lys151_contigs
Lys31_contigs
Lys27_contigs
Lys42_contigs
Lys51_contigs
Lys33_contigs
Lys46_contigs
Lys38_contigs
Lys89_contigs
Lys23_contigs
Lys115_contigs
Lys108_contigs
Lys104_contigs
DSM10973_contigs
Lys125_contigs
Lys105_contigs
Lys17_contigs
Lys128_contigs
Lys66_contigs
Lys73_contigs
Lys15_contigs
Lys91_contigs
DSM8698_contigs
DSM8695_contigs
Lys74_contigs
Lys61_contigs
Lys9_contigs
Lys153_contigs
Lys84_contigs
Lys93_contigs
Lys72_contigs
Lys62_contigs
Lys21_contigs
Lys59_contigs
Lys63_contigs
Lys83_contigs
Lys19_contigs
Lys4_contigs
AW13_contigs
Lys45_contigs
Lys28_contigs
Lys53_contigs
Lys52_contigs
Lys34_contigs
Lys36_contigs
Lys24_contigs
Lys35_contigs
Lys68_contigs
Lys106_contigs
Lys88_contigs
Lys97_contigs
Lys76_contigs
Lys134_contigs
Lys58_contigs
Lys71_contigs
Lys81_contigs
Lys129_contigs
Lys120_contigs
Lys145_contigs
Lys137_contigs
Lys127_contigs
Lys152_contigs
Lys101_contigs
Lys98_contigs
Lys70_contigs
Lys133_contigs
Lys47_contigs
Lys75_contigs
Lys48_contigs
Lys148_contigs
Lys139_contigs
Lys141_contigs
Lys164_contigs
Lys149_contigs
Lys147_contigs
Lys60_contigs
Lys79_contigs
Lys168_contigs
Lys18_contigs
Lys87_contigs
Lys96_contigs
Lys7_contigs
Lys154_contigs
Lys117_contigs
Lys119_contigs
Lys178_contigs
Lys116_contigs
Lys86_contigs
Lys90_contigs
Lys41_contigs
Lys13_contigs
Lys85_contigs
X5002_contigs
Lys12_contigs
Lys39_contigs
Lys14_contigs
Lys55_contigs
Lys29_contigs
Lys99_contigs
X5035_contigs
Lys8_contigs
Lys3_contigs
X5034_contigs
X5088_contigs
Lys20_contigs
Lys78_contigs
Lys11_contigs
Brunei20070942_contigs
Muenster20063091_contigs
Senftenberg20070885_contigs
Lys142_contigs
Lys175_contigs
Lys130_contigs
Lys170_contigs
Lys126_contigs
Lys167_contigs
Lys176_contigs
Lys169_contigs
Lys50_contigs
5038_contigs
Lys131_contigs
Lys171_contigs
Lys111_contigs
Lys107_contigs
Lys114_contigs
Lys16_contigs
Lys22_contigs
Lys65_contigs
Lys56_contigs
Lys113_contigs
Lys109_contigs
Lys77_contigs
Lys102_contigs
Lys100_contigs
Lys92_contigs
Lys94_contigs
Lys80_contigs
Lys64_contigs
Lys82_contigs
AW3_contigs
5008_contigs
AW4_contigs
AW1_contigs
Lys118_contigs
Lys138_contigs
Lys121_contigs
Lys122_contigs
Lys177_contigs
Lys155_contigs
Lys165_contigs
Lys163_contigs
Lys160_contigs
Lys161_contigs
Lys172_contigs
Lys144_contigs
Lys135_contigs
Lys146_contigs
Lys123_contigs
Lys124_contigs
Lys150_contigs
Lys140_contigs
Lys157_contigs
Lys173_contigs
Lys156_contigs
Lys158_contigs
Lys159_contigs
Lys162_contigs
Lys5_contigs
5084_contigs
5042_contigs
Lys110_contigs
Lys136_contigs
Lys54_contigs
Lys1_contigs
Lys6_contigs
Lys112_contigs
5012_contigs
Lys30_contigs
Lys25_contigs
Lys43_contigs
Lys37_contigs
Lys40_contigs
Lys151_contigs
Lys31_contigs
Lys27_contigs
Lys42_contigs
Lys51_contigs
Lys33_contigs
Lys46_contigs
Lys38_contigs
Lys89_contigs
Lys23_contigs
Lys115_contigs
Lys108_contigs
Lys104_contigs
DSM10973_contigs
Lys125_contigs
Lys105_contigs
Lys17_contigs
Lys128_contigs
Lys66_contigs
Lys73_contigs
Lys15_contigs
Lys91_contigs
DSM8698_contigs
DSM8695_contigs
Lys74_contigs
Lys61_contigs
Lys9_contigs
Lys153_contigs
Lys84_contigs
Lys93_contigs
Lys72_contigs
Lys62_contigs
Lys21_contigs
Lys59_contigs
Lys63_contigs
Lys83_contigs
Lys19_contigs
Lys4_contigs
AW13_contigs
Lys45_contigs
Lys28_contigs
Lys53_contigs
Lys52_contigs
Lys34_contigs
Lys36_contigs
Lys24_contigs
Lys35_contigs
Lys68_contigs
Lys106_contigs
Lys88_contigs
Lys97_contigs
Lys76_contigs
Lys134_contigs
Lys58_contigs
Lys71_contigs
Lys81_contigs
Lys129_contigs
Lys120_contigs
Lys145_contigs
Lys137_contigs
Lys127_contigs
Lys152_contigs
Lys101_contigs
Lys98_contigs
Lys70_contigs
Lys133_contigs
Lys47_contigs
Lys75_contigs
Lys48_contigs
Lys148_contigs
Lys139_contigs
Lys141_contigs
Lys164_contigs
Lys149_contigs
Lys147_contigs
Lys60_contigs
Lys79_contigs
Lys168_contigs
Lys18_contigs
Lys87_contigs
Lys96_contigs
Lys7_contigs
Lys154_contigs
Lys117_contigs
Lys119_contigs
Lys178_contigs
Lys116_contigs
Lys86_contigs
Lys90_contigs
Lys41_contigs
Lys13_contigs
Lys85_contigs
5002_contigs
Lys12_contigs
Lys39_contigs
Lys14_contigs
Lys55_contigs
Lys29_contigs
Lys99_contigs
5035_contigs
Lys8_contigs
Lys3_contigs
5034_contigs
5088_contigs
Lys20_contigs
Lys78_contigs
Lys11_contigs
ANIm
0.9 0.92 0.94 0.96 0.98
Value
0100020003000400050006000
Color Key
and Histogram
Count
A
B1
B2
C
D
E
F
U
X
2014: Campylobacter spp.
Sequenced ≈ 1034 isolates of Campylobacter
Clinical, animal, food-associated isolates
• Illumina paired-end sequencing. Total cost of sequencing
>1000 bacteria: ≈£60k
• Automated annotation: PRODIGAL
2014: Campylobacter spp.
• Identified 15554 gene families from genecalls.
• To calculate, took 23 days on institute cluster (4e12 pairwise
protein comparisons!).
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
So what’s changed?
• Cost: £250k per genome, to £60 per genome.
Now cheaper to sequence a genome than to analyse it!
• Location: sequencing centre, to benchtop
• Data: volume has increased massively - what you get back
from machines, and what’s out there to work with
More data is better, but also more challenging.
• Speed: typical sequencing run time can be less than a day
• Software: more software to do more things (but not always
better. . .)
• New kinds of experiment: genomes, exomes, variant calling,
methylated sequences, . . .
• New kinds of application: diagnostics, epidemic tracking,
metagenomics, . . .
So what’s changed?
Having a single genome is useful, but having thousands really helps
comparative genomics:
combining genomic data, evolutionary and comparative biology
• Transfer functional understanding of model systems (e.g. E.
coli) to non-model organisms
• Genomic differences may underpin phenotypic (host range,
virulence, physiological) differences
• Genome comparisons aid identification of functional elements
on the genome
• Studying genomics changes reveals evolutionary processes and
constraints
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Revolutions One and Twoa
a
Loman and Pallen (2015) Nat. Rev. Micro. doi:10.1038/nrmicro3565
Revolution One: whole-genome shotgun
• First bacterial genomes: Haemophilus influenzae (1995); E.
coli, Bacillus subtilis (1997)
• (Oh, and the human genome)
Revolution Two: high-throughput sequencing
• ”Next-generation” sequencing (now ”last-generation”).
• 454 GS20 (2005), Illumina GAII (2007).
• metagenomics; surveillance sequencing; SNP-based
comparisons; transposon-sequencing for functional genomics;
ChIP-seq; . . .
Not all HT sequencing is the same
It’s all about the biology, but it all starts with the data.
Sequencing technology (including library prep.) affects your
sequence data.
• Roche/454
• Illumina
• Ion Torrent
• Pacific Bioscience (PacBio)
The basic principle
DNA source is fragmented, and the fragments are sequenced.
HTS: PE vs SE
High-throughput sequencing (e.g. Illumina), reads may be
single-end, or paired-end.
Putting the jigsaw back together is sequence assembly.
Four different chemistriesa
a
Loman et al. (2012) Nat. Rev. Micro. 31:294-296 doi:10.1038/nbt.2522
Reads differ by technology, and may require different bioinformatic
treatment. . .
• Roche/454: Pyrosequencing (long reads, but expensive, and
high homopolymer errors) (700-800bp, 0.7Gbp, 23h)
• Illumina: Reversible terminator (cost-effective, massive
throughput, but short read lengths) (2x150bp, 1.5Gbp, 27h)
• Ion Torrent: Proton detection (short run times, good
throughput, high homopolymers errors) (200bp, 1Gbp, 3h)
• PacBio: Real-time sequencing (very long reads, high error
rate, expensive) (3-15kbp, 3Gbp/day, 20min)
. . . different error profiles, varying capability to assemble/determine
variation
Costs of sequencinga
a
Miyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Benchmarked performance
Apply several sequencing technologies to the same sample(s).
Benchmark comparisons inform appropriate choice of sequencing
technology6,7,8,9,10,11,12
Progress in technologies is driving research very rapidly.
Always look for most recent/relevant benchmarks.
Bioinformatic methods also need to be benchmarked.
6
Miyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699
7
Salipante et al. (2014) Appl. Environ. Micro. 80:7583-7591 doi:10.1128/AEM.02206-14
8
Frey et al. (2014) BMC Genomics 15:96 doi:10.1186/1471-2164-15-96
9
Koshimizu et al. (2013) PLoS One 8:e74167 doi:10.1371/journal.pone.0074167
10
Quail et al. (2012) BMC Genomics 13:341 doi:10.1186/1471-2164-13-341
11
Loman et al. (2012) Nat. Biotech. 30:434-439 doi:10.1038/nbt.2198
12
Lam et al. (2011) Nat. Biotech. 1 (6) doi:10.1038/nbt.2065
Benchmarking on Vibrioa
a
Miyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699
• Sequenced Vibrio parahaemolyticus (2x chromosomes, closed
reference genome) with four technologies
• Chose an assembler for each tech, and assembled reads
• Excess reads with Ion/MiSeq: used random subsets of reads
to determine required coverage
• Aligned assemblies (MUMmer) to known high-quality
chromosome sequence, to measure error
Benchmarking on Vibrioa
a
Miyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699
Benchmarking on Vibrioa
a
Miyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699
De novo assembly and alignment against Vibrio parahaemolyticus
(2x chromosomes)
Benchmarking on Vibrioa
a
Miyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699
• More and longer reads do not always give the best assemblies:
read depth, read distribution, error rate also matters
• Optimal assemblies were obtained at around 60x-80x
coverage, for Illumina and Ion.
• Multiple rRNA regions are fragmented in short-read assemblies
• PacBio generated single chromosome contigs
• Assembly of multiple-chromosome bacteria is currently feasible
Variability in published genomes as methods are not standard (e.g.
sequencing technology, assembler, parameter settings and
pre-processing). . .
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Revolution Threea
a
Loman and Pallen (2015) Nat. Rev. Micro. doi:10.1038/nrmicro3565
Revolution Three: single-molecule long-read sequencing
• Living through the revolution, now
• PacBio (SMRT): large machine, expensive
• Nanopore: portable device, inexpensive
• Less mature, less accurate, improving rapidly
The future dominant sequencer?
Oxford Nanopore. A sequencer the size of your hand.
• Microfluidics, single-molecule sequencing; 11-70kbp reads
• Reports current across pore (tiny electron microscope) as
molecule moves through
• $10/Mbp, 110Mbp per flowcell13
13
Yaniv Erlich (2013) Future Continuous blog
Early dataa
a
Quick et al. (2014) GigaScience 3:22 doi:10.1111/1755-0998.12324
It’s a fast-moving area, and results are improving.
Developing tools
Oxford Nanopore’s open beta went out without analysis tools.
Tools (Poretools, poRe, etc.) were written/tested/validated by the
user community14,15
14
Loman and Quinlan (2014) Bioinformatics doi:10.1093/bioinformatics/btu555
15
Watson et al. (2014) Bioinformatics doi:10.1093/bioinformatics/btu590
Recent applications
• Amplicon sequencing (16S metagenomics) of bacteria and
viruses 16
• Real-time viral diagnostics 17
• Scaffolding of a bacterial genome 18
• Complete de novo assembly of a bacterial genome 19
16
Kilianski et al. (2015) GigaScience doi:10.1186/s13742-015-0051-z
17
Greninger et al. (2015) Genome Med. doi:10.1186/s13073-015-0220-9
18
Karlsson et al. (2015) Sci. Reports doi:10.1038/srep11996
19
Loman et al. (2015) Nat. Meth. doi:10.1038/nmeth.3444
The three revolutionsa
a
Loman and Pallen (2015) Nat. Rev. Micro. doi:10.1038/nrmicro3565
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Predicting the future is hard. . .
“How many genomes will we have, and when?”
Su et al. attempted to answer this20:
20
http://sulab.org/2013/06/sequenced-genomes-per-year/
After that, the flood. . .
High-throughput sequencing methods have completely changed the
landscape of microbiology
(Nearly) complete, (mainly) accurate sequence data is now
inexpensive (and cheaper than analysis)
• GOLD (19/2/2014): 3,011 “finished” ; 9,891 “permanent
draft” genomes
• GOLD (10/11/2015): 7,657 “finished” ; 27,438 “permanent
draft” genomes; 50,673 prokaryotes
• NCBI WGS (19/2/2014): 17,023 microbial genomes
• NCBI Genome (10/11/2015): 55,033 prokaryotic genomes
Pseudomonas
In 2011, 25 isolate sequences21; in 2015, 2098 genomes:
We’re going to need bigger bioinformatics. . .
21
Studholme (2011) Mol. Plant Pathol. doi:10.1111/j.1364-3703.2011.00713.x
Microbial Genomics and
Bioinformatics
BM405
2.Assembly
Leighton Pritchard1,2,3
1
Information and Computational Sciences,
2
Centre for Human and Animal Pathogens in the Environment,
3
Dundee Effector Consortium,
The James Hutton Institute, Invergowrie, Dundee, Scotland, DD2 5DA
Acceptable Use Policy
Recording of this talk, taking photos, discussing the content using
email, Twitter, blogs, etc. is permitted (and encouraged),
providing distraction to others is minimised.
These slides will be made available on SlideShare.
These slides, and supporting material including exercises, are
available at https://github.com/widdowquinn/Teaching-
Strathclyde-BM405
What do you get from sequencing
Sequence reads. Usually lots of them.
Size/number/errors depend on technology used.
22
22
Miyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699
Sequence Read Data Formats
Two common read data sequence formats:
• FASTQ: Related to FASTA, a de facto standard for sequence
reads
• SAM/BAM: Sequence alignment/mapping format, two
flavours - uncompressed and compressed
New formats are required to handle very large numbers of genomes
• CRAM: Reference-based sequence compression
You might also receive assembled genomes directly from a
sequencing partner
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
FASTQa
a
Cock et al. (2009) Bioinformatics 38:1767-1771 doi:10.1093/nar/gkp1137
@HISEQ2500-09:168:HA424ADXX:2:1101:1404:2061 1:N:0:ATCTCTCTCACCAACT
CGGTCTTGGGATAGATGGGTTGCAGGTTGCGGTAAAGCTCGGACTCCAGAGCGTCCAGGGTAGACTGGCTAATCTTCTGCTCTTTATCGATCATTATTTC
+
@@CBDDFFHHDFDHEGHIICGIFHHIIIIFHGGHIEHHIIIIGHGHIIIIIGGHHFFFFC@CBCCCDDBDCDDDDDDDDCCDDDD3@ABDDDDDEEEDE@
Files typically have .fq, .fastq extension.
Four lines per sequence
1. Header: sequence identifier and optional description, starts
with “@”
2. Raw sequence ([ACGTN])
3. Optional header, repeats line 1, starts with “+”
4. Quality scores, numbers encoded as ASCII
Qphred = −10 log10 e, where e is the estimated probability
that a base call is incorrect (like a pH).
Quality Control
The quality of basecalls (error rate) varies between and along reads.
(real data from our E.coli sequencing: good quality)
Quality Control
Some datasets are better than others.
Reads can be trimmed, or discarded.
Including poor reads compromises assembly.
FASTQ encodinga
a
Cock et al. (2009) Bioinformatics 38:1767-1771 doi:10.1093/nar/gkp1137
More than one version of FASTQ, differ by quality encoding
Numbers converted to ASCII start at different values
FASTQ encodinga
a
Cock et al. (2009) Bioinformatics 38:1767-1771 doi:10.1093/nar/gkp1137
Versions vary by sequencer and period.
Most now settled on Sanger format (occasionally see historical
data).
Quality scores (Qphred ) offset to lie in the given range:
1. Sanger: 33-126, used in SAM/BAM, and Illumina 1.8+
2. Illumina 1.0-1.2: 59-126
3. Illumina 1.3-1.8: 64-126
Knowing where your data comes from, and the data format
and version, is always important.
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
SAMa
a
https://github.com/samtools/hts-specs
Intended to represent read alignments, also used for raw reads.
Tab-delimited plain text. Headers (optional) start with “@”
BAMa
/CRAMb
a
https://github.com/samtools/hts-specs
b
http://www.ebi.ac.uk/ena/software/cram-toolkit
BAM is a compressed version of SAM.
• BGZF compression.
• Random access within compressed file, through indexing.
CRAM format may come to dominate, especially in archives, as
datasets get larger:
• Reference-based compression.23
• Highly suited to compression and archiving of very large
amounts of sequence data.24
23
Fritz et al. (2011) Genome Res. 21:734-740 doi:10.1101/gr.114819.110
24
Cochrane et al. (2012) GigaScience 1:2 doi:10.1186/2047-217X-1-2
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Read repositories
Repositories are centrally-maintained locations that keep sequence
read data from multiple projects
Submission to a repository is a requirement for publication. And
the right thing to do!
• ENA: The European Nucleotide Archive
(http://www.ebi.ac.uk/ena), maintained by EBI/EMBL
• SRA: The Short Read Archive
(http://www.ncbi.nlm.nih.gov/sra), maintained in the US by
NCBI
Sequence Assembly
Once you have reads, you can assemble a genome.
Two main approaches to read assembly:
• Overlap-Layout-Consensus: Typically used with smaller sets
of longer reads (e.g. 454, PacBio, Ion, Nanopore)
• de Bruijn assembly: Typically used with many, shorter reads
(e.g. Illumina), but also useful for longer reads
See e.g. Leland Taylor’s thesis
(http://gcat.davidson.edu/phast/docs/Thesis PHAST LelandTaylor.pdf),
and PHAST (http://gcat.davidson.edu/phast/index.html).
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Overlap-Layout-Consensus
Overlap-Layout-Consensus
The oldest approach, originally used with smaller sets of fewer
reads.
Can be time consuming (all-vs-all comparisons), but offset with
graph-based OLC algorithms (e.g. SGA).
Now more important again, with long-read data.
• Celera Assembler25
• Newbler (the Roche/454 GS assembler)26
• String Graph Assembler27
25
http://wgs-assembler.sourceforge.net/
26
http://www.454.com/products/analysis-software/
27
Simpson and Durbin (2012) Genome Res. 22:549-556 doi:10.1101/gr.126953.111
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
de Bruijn graph assembly
Used for short reads (e.g. Illumina):
k-mer based graph (choice of k important):
de Bruijn graph assembly
k-mer based genome and read graphs28
“True” edges = genome; “Error” edges = wrong assembly
28
Chaisson et al. (2009) Genome Res. 19:336-346 doi:10.1101/gr.079053.108
de Bruijn graph assembly
All sequencing technologies have basecall errors.
• The proportion of errors is approximately constant per read
• Basecall errors lead to edge errors
• The more reads you have, the more errors there are
Increased coverage does not ensure increased accuracy29
29
Conway and Bromage (2011) Bioinformatics 27:479-486 doi:10.1093/bioinformatics/btq697
de Bruijn graph assembly
Fast, and scales well to large datasets, as it never computes
all-against-all overlaps.
Sensitive to sequencing errors, but resolves short repeats (graph
bulges and whirls).
Notable tools:
• Velvet30
• CLC Assembly Cell31
• Cortex32
30
Zerbino and Birney (2008) Genome Res. 18:821-829 doi:10.1101/gr.074492.107
31
http://www.clcbio.com/products/clc-assembly-cell/
32
Iqbal et al. (2012) Nat. Genet. 44:226-232 doi:10.1038/ng.1028
“Coloured” de Bruijn graph assemblies
Cortex33 allows for on-the-fly identification of complex variation,
and genotyping, by tracking “coloured” edges in the graph.
Colours ≈ different isolates/organisms (e.g. a reference)
33
Iqbal et al. (2012) Nat. Genet. 44:226-232 doi:10.1038/ng.1028
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Why map reads?a
a
Trapnell et al. (2009) Nat. Biotech. 27:455-457 doi:10.1038/nbt0509-455
“Resequencing” an organism (sequencing a close relative, looking
for SNPs/indels)
RNA-seq, ChIP-seq, etc. - coverage ≈ expression/binding
To see where reads map on an assembled genome
• Is coverage even? (can indicate repeats)
• Are there SNPs/indels? (heterogeneous population)
• Assembly problems?
Short-Read Sequence Alignmenta
a
Trapnell et al. (2009) Nat. Biotech. 27:455-457 doi:10.1038/nbt0509-455
An embarrassment of tools (over 60 listed on Wikipedia)
Main approaches:
• Alignment: Smith-Waterman mathematically guaranteed to
be the best alignment available (e.g. BFAST, MOSAIK);
approximation to S-W (e.g. BLAST); ungapped or gapped
alignment (e.g. MAQ, FAST, mrFAST, SOAP). Can be slow.
• Burrows-Wheeler Transform: Makes reusable index of the
genome (e.g. Bowtie, BWA), can be extended to consider
sequence probability (e.g. BWA-PSSM). Can be very fast.
Other tools may employ different algorithms, some designed to be
parallelised on GPUs/FPGAs (e.g. NextGenMap, XpressAlign)
Visualising Read Mapping
Several tools available, e.g. Tablet (the best. . .)34
34
Milne et al. (2013) Brief. Bioinf. 14:193-202 doi:10.1093/bib/bbs012
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
In an ideal world
Ideally, you would have one sequence per chromosome/plasmid.
(and no errors): a closed/complete genome.
PacBio, Sanger, manual closing, Nanopore(?)
More realistically. . .
Typically, a number of assembled fragments (contigs or scaffolds)
are returned in FASTA format: a draft, disordered genome.
Around 250 contigs for a 5Mbp genome is usual with Illumina
Ordering contigs
Contigs can be ordered correctly into scaffolds if paired-end reads
span gaps, or long reads are available (typically done during
assembly).
Gaps are usually filled with Ns (length estimated)
Ordering contigs
Contigs and scaffolds can also be reordered by alignment to a
reference genome.
• Mauve/progressiveMauve35
• MUMmer36
35
Darling et al. (2004) Genome Res. 14:1394-1403 doi:10.1101/gr.2289704
36
Kurtz et al. (2004) Genome Biol. 5:R12 doi:10.1186/gb-2004-5-2-r12
Where next?a
a
Lefebure et al. (2010) Genome Biol. Evol. 2:646-655 doi:10.1093/gbe/evq048
Microbial Genomics and
Bioinformatics
BM405
3.Whole Genome Comparisons
Leighton Pritchard1,2,3
1
Information and Computational Sciences,
2
Centre for Human and Animal Pathogens in the Environment,
3
Dundee Effector Consortium,
The James Hutton Institute, Invergowrie, Dundee, Scotland, DD2 5DA
Acceptable Use Policy
Recording of this talk, taking photos, discussing the content using
email, Twitter, blogs, etc. is permitted (and encouraged),
providing distraction to others is minimised.
These slides will be made available on SlideShare.
These slides, and supporting material including exercises, are
available at https://github.com/widdowquinn/Teaching-
Strathclyde-BM405
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
The Power of Comparative Genomics
Massively enabled by high-throughput sequencing, and the
availability of thousands of sequenced isolates.
Computational comparisons more powerful and precise than
experimental comparative genomics: the ultimate microbial
typing solution
Three broad areas/scales:
• Comparison of bulk genome properties
• Whole genome sequence comparisons
• Comparison of features/functional components
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Nucleotide frequency/genome size
• Very easy to calculate from complete/draft genome
• Can calculate for individual contigs/scaffolds/regions
• Usually reported in GUI genome browsers
Trivial to determine using, e.g. Python
Nucleotide frequency/genome size
GC content and chromosome size can be characteristic
See data/bacteria size for example iPython notebook exercise
Blobologya
a
Kumar and Blaxter et al. (2011) Symbiosis 3:119-126 doi:10.1007/s13199-012-0154-6
Sequencing samples may be
contaminated or contain
microbial symbionts.
Expect more host than
symbiont/contaminant DNA
GC content and read coverage
can be used to separate
contigs, following assembly
and mapping
http://nematodes.org/bioinformatics/blobology/
k-mers
• Nucleotides: [ACGT]
• Dinucleotides: [AA|AC|AG|AT|CA|CC|. . .] (16 dimers)
• Trinucleotides: [AAA|AAC|AAG|AAT|ACA|. . .] (64 trimers)
• k-mers: 4k k-mers
(see example in data/shiny)
k-mers
GC content = point value; k-mer frequencies = vector (list)
Diagnostic differences in k-mer frequency, and variability.
The basis of several comparison tools
E.coli Mycoplasma spp.
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
What to align, and why?
To be useful, aligned genomes should:
• derive from a sufficiently recent common ancestor, so
homologous regions can be identified
• derive from a sufficiently distant common ancestor, so that
there are “interesting” differences to be identified
• help to answer your biological question
How to align, and why?
Naive sequence aligners (Needleman-Wunsch, Smith-Waterman)
are not appropriate for genome alignment
• Computationally expensive on large sequences
• Cannot handle rearrangements
Very many alternative alignment algorithms proposed
• megaBLAST http://www.ncbi.nlm.nih.gov/blast/html/megablast.html
• MUMmer http://mummer.sourceforge.net/
• BLAT http://genome.ucsc.edu/goldenPath/help/blatSpec.html
• LASTZ http://www.bx.psu.edu/∼rsharris/lastz/
• LAGAN http://lagan.stanford.edu/lagan web/index.shtml
• and many, many more. . .
Example exercises in data/whole_genome_alignment.
megaBLAST
Optimised for speed, over BLASTN37
• Genome-level searches
• Queries on large sequence sets
• Long alignments of very similar sequence
Uses the greedy algorithm by Zhang et al.38, not BLAST
algorithm.
• Concatenates queries (“query packing”) to improve
performance
• Two modes: megaBLAST and discontinuous
(dc-megablast) for divergent sequences
BLASTN now uses the megaBLAST algorithm by default
37
http://www.ncbi.nlm.nih.gov/blast/Why.shtml
38
Zhang et al. (2000) J. Comp. Biol. 7:203-214 doi:10.1089/10665270050081478
BLAST vs megaBLAST
megaBLAST is faster, but does it give the same biological results?
megaBLAST (top) and BLAST (bottom) pairwise comparisons:
BLAST vs megaBLAST
Filter out weak matches - not quite identical:
MUMmera
a
Kurtz et al. (2004) Genome. Biol. 5:R12 doi:10.1186/gb-2004-5-2-r12
Uses suffix trees for pattern matching: very fast even for large
sequences
• Finds maximal exact
matches
• Memory use depends only
on the reference sequence
size
Suffix trees:
(http://en.wikipedia.org/wiki/Suffix tree)
• Can be built and searched
in O(n) time
• But useful algorithms are
nontrivial
The MUMmer algorithma
a
Kurtz et al. (2004) Genome. Biol. 5:R12 doi:10.1186/gb-2004-5-2-r12
1. Identify a non-overlapping subset of maximal exact matches:
often Maximal Unique Matches (MUMs)
2. Cluster into alignment anchors
3. Extend between anchors to produce the final alignment
This is the basis of a very flexible suite of programs that align
different kinds of sequence: mummer, nucmer, promer
• nucleotide and (more sensitive) “conceptual protein”
alignments
• used for genome comparisons, assembly scaffolding, repeat
detection, . . .
• the basis of other aligners/assemblers (e.g. Mugsy, AMOS)
MUMmer vs megaBLAST
MUMmer identifies fewer weak matches
megaBLAST (top) and MUMmer (bottom) pairwise comparisons:
MUMmer vs megaBLAST
Filter out weak BLAST matches - not quite identical:
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
DNA-DNA hybridisationa
a
Morello-Mora and Amann (2001) FEMS Micro. Rev. 25:39-67 doi:10.1016/S0168-6445(00)00040-1
• “Gold Standard” for
prokaryotic taxonomy,
since 1960s. “70%
identity ≈ same species.”
• Denature DNA from two
organisms.
• Allow to anneal.
Reassociation ≈ similarity,
measured as ∆T of
denaturation curves.
Proxy for sequence similarity - replace with genome analysis39?
39
Chan et al (2012) BMC Microbiol. 12:302 doi:10.1186/1471-2180-12-302
Average Nucleotide Identity (ANIb)a
a
Goris et al. (2007) Int. J. Syst. Biol. 57:81-91 doi:10.1099/ijs.0.64483-0
1. Break genomes into 1020t
fragments
2. ANIb: Mean % identity of
all BLASTN matches with
> 30% identity and > 70%
fragment coverage.
• DDH:ANIb linear
• DDH:%ID linear
• 70%ID ≈ 95%ANIb
Average Nucleotide Identity (ANIm)a
a
Richter and Rossello-Mora (2009) Proc. Natl. Acad. Sci. USA 106:19126-19131
doi:10.1073/pnas.0906412106
1. Align genomes
(MUMmer)
2. ANIm: Mean
% identity of all
matches
• DDH:ANIm
linear
• 70%ID ≈
95%ANIb
TETRA: tetranucleotide frequency-based classifier introduced in
same paper.
ANI/TETRA comparison
All three methods applied to Anaplasma spp.
ANIb:
A_phagocytophilum_NC_021881
A_phagocytophilum_NC_021880
A_phgocytophilum_NC_021879
A_phagocytophilum_NC_007797
A_centrale_NC_013532
A_marginale_NC_004842
A_marginale_NC_012026
A_marginale_NC_022760
A_marginale_NC_022784
A_phagocytophilum_NC_021881
A_phagocytophilum_NC_021880
A_phgocytophilum_NC_021879
A_phagocytophilum_NC_007797
A_centrale_NC_013532
A_marginale_NC_004842
A_marginale_NC_012026
A_marginale_NC_022760
A_marginale_NC_022784
ANIb
0.9 0.94 0.98
Value
02040
Color Key
and Histogram
Count
ANIm:
A_phgocytophilum_NC_021879
A_phagocytophilum_NC_007797
A_phagocytophilum_NC_021880
A_phagocytophilum_NC_021881
A_centrale_NC_013532
A_marginale_NC_012026
A_marginale_NC_004842
A_marginale_NC_022760
A_marginale_NC_022784
A_phgocytophilum_NC_021879
A_phagocytophilum_NC_007797
A_phagocytophilum_NC_021880
A_phagocytophilum_NC_021881
A_centrale_NC_013532
A_marginale_NC_012026
A_marginale_NC_004842
A_marginale_NC_022760
A_marginale_NC_022784
ANIm
0.9 0.94 0.98
Value0102030
Color Key
and Histogram
Count
TETRA:
A_phagocytophilum_NC_021880
A_phgocytophilum_NC_021879
A_phagocytophilum_NC_007797
A_phagocytophilum_NC_021881
A_centrale_NC_013532
A_marginale_NC_022760
A_marginale_NC_022784
A_marginale_NC_012026
A_marginale_NC_004842
A_phagocytophilum_NC_021880
A_phgocytophilum_NC_021879
A_phagocytophilum_NC_007797
A_phagocytophilum_NC_021881
A_centrale_NC_013532
A_marginale_NC_022760
A_marginale_NC_022784
A_marginale_NC_012026
A_marginale_NC_004842
TETRA
0.9 0.94 0.98
Value
02040
Color Key
and Histogram
Count
ANIb discards information, relative to ANIm: less sensitive
ANIb/ANIm ≈ evolutionary history; TETRA ≈ bulk composition
ANI in practice
Practical applications40 (note: no gene content used)
34 Dickeya isolates:
species structure
180 E.coli isolates:
subtyping
Brunei20070942_contigs
Muenster20063091_contigs
Senftenberg20070885_contigs
Lys142_contigs
Lys175_contigs
Lys130_contigs
Lys170_contigs
Lys126_contigs
Lys167_contigs
Lys176_contigs
Lys169_contigs
Lys50_contigs
X5038_contigs
Lys131_contigs
Lys171_contigs
Lys111_contigs
Lys107_contigs
Lys114_contigs
Lys16_contigs
Lys22_contigs
Lys65_contigs
Lys56_contigs
Lys113_contigs
Lys109_contigs
Lys77_contigs
Lys102_contigs
Lys100_contigs
Lys92_contigs
Lys94_contigs
Lys80_contigs
Lys64_contigs
Lys82_contigs
AW3_contigs
X5008_contigs
AW4_contigs
AW1_contigs
Lys118_contigs
Lys138_contigs
Lys121_contigs
Lys122_contigs
Lys177_contigs
Lys155_contigs
Lys165_contigs
Lys163_contigs
Lys160_contigs
Lys161_contigs
Lys172_contigs
Lys144_contigs
Lys135_contigs
Lys146_contigs
Lys123_contigs
Lys124_contigs
Lys150_contigs
Lys140_contigs
Lys157_contigs
Lys173_contigs
Lys156_contigs
Lys158_contigs
Lys159_contigs
Lys162_contigs
Lys5_contigs
X5084_contigs
X5042_contigs
Lys110_contigs
Lys136_contigs
Lys54_contigs
Lys1_contigs
Lys6_contigs
Lys112_contigs
X5012_contigs
Lys30_contigs
Lys25_contigs
Lys43_contigs
Lys37_contigs
Lys40_contigs
Lys151_contigs
Lys31_contigs
Lys27_contigs
Lys42_contigs
Lys51_contigs
Lys33_contigs
Lys46_contigs
Lys38_contigs
Lys89_contigs
Lys23_contigs
Lys115_contigs
Lys108_contigs
Lys104_contigs
DSM10973_contigs
Lys125_contigs
Lys105_contigs
Lys17_contigs
Lys128_contigs
Lys66_contigs
Lys73_contigs
Lys15_contigs
Lys91_contigs
DSM8698_contigs
DSM8695_contigs
Lys74_contigs
Lys61_contigs
Lys9_contigs
Lys153_contigs
Lys84_contigs
Lys93_contigs
Lys72_contigs
Lys62_contigs
Lys21_contigs
Lys59_contigs
Lys63_contigs
Lys83_contigs
Lys19_contigs
Lys4_contigs
AW13_contigs
Lys45_contigs
Lys28_contigs
Lys53_contigs
Lys52_contigs
Lys34_contigs
Lys36_contigs
Lys24_contigs
Lys35_contigs
Lys68_contigs
Lys106_contigs
Lys88_contigs
Lys97_contigs
Lys76_contigs
Lys134_contigs
Lys58_contigs
Lys71_contigs
Lys81_contigs
Lys129_contigs
Lys120_contigs
Lys145_contigs
Lys137_contigs
Lys127_contigs
Lys152_contigs
Lys101_contigs
Lys98_contigs
Lys70_contigs
Lys133_contigs
Lys47_contigs
Lys75_contigs
Lys48_contigs
Lys148_contigs
Lys139_contigs
Lys141_contigs
Lys164_contigs
Lys149_contigs
Lys147_contigs
Lys60_contigs
Lys79_contigs
Lys168_contigs
Lys18_contigs
Lys87_contigs
Lys96_contigs
Lys7_contigs
Lys154_contigs
Lys117_contigs
Lys119_contigs
Lys178_contigs
Lys116_contigs
Lys86_contigs
Lys90_contigs
Lys41_contigs
Lys13_contigs
Lys85_contigs
X5002_contigs
Lys12_contigs
Lys39_contigs
Lys14_contigs
Lys55_contigs
Lys29_contigs
Lys99_contigs
X5035_contigs
Lys8_contigs
Lys3_contigs
X5034_contigs
X5088_contigs
Lys20_contigs
Lys78_contigs
Lys11_contigs
Brunei20070942_contigs
Muenster20063091_contigs
Senftenberg20070885_contigs
Lys142_contigs
Lys175_contigs
Lys130_contigs
Lys170_contigs
Lys126_contigs
Lys167_contigs
Lys176_contigs
Lys169_contigs
Lys50_contigs
5038_contigs
Lys131_contigs
Lys171_contigs
Lys111_contigs
Lys107_contigs
Lys114_contigs
Lys16_contigs
Lys22_contigs
Lys65_contigs
Lys56_contigs
Lys113_contigs
Lys109_contigs
Lys77_contigs
Lys102_contigs
Lys100_contigs
Lys92_contigs
Lys94_contigs
Lys80_contigs
Lys64_contigs
Lys82_contigs
AW3_contigs
5008_contigs
AW4_contigs
AW1_contigs
Lys118_contigs
Lys138_contigs
Lys121_contigs
Lys122_contigs
Lys177_contigs
Lys155_contigs
Lys165_contigs
Lys163_contigs
Lys160_contigs
Lys161_contigs
Lys172_contigs
Lys144_contigs
Lys135_contigs
Lys146_contigs
Lys123_contigs
Lys124_contigs
Lys150_contigs
Lys140_contigs
Lys157_contigs
Lys173_contigs
Lys156_contigs
Lys158_contigs
Lys159_contigs
Lys162_contigs
Lys5_contigs
5084_contigs
5042_contigs
Lys110_contigs
Lys136_contigs
Lys54_contigs
Lys1_contigs
Lys6_contigs
Lys112_contigs
5012_contigs
Lys30_contigs
Lys25_contigs
Lys43_contigs
Lys37_contigs
Lys40_contigs
Lys151_contigs
Lys31_contigs
Lys27_contigs
Lys42_contigs
Lys51_contigs
Lys33_contigs
Lys46_contigs
Lys38_contigs
Lys89_contigs
Lys23_contigs
Lys115_contigs
Lys108_contigs
Lys104_contigs
DSM10973_contigs
Lys125_contigs
Lys105_contigs
Lys17_contigs
Lys128_contigs
Lys66_contigs
Lys73_contigs
Lys15_contigs
Lys91_contigs
DSM8698_contigs
DSM8695_contigs
Lys74_contigs
Lys61_contigs
Lys9_contigs
Lys153_contigs
Lys84_contigs
Lys93_contigs
Lys72_contigs
Lys62_contigs
Lys21_contigs
Lys59_contigs
Lys63_contigs
Lys83_contigs
Lys19_contigs
Lys4_contigs
AW13_contigs
Lys45_contigs
Lys28_contigs
Lys53_contigs
Lys52_contigs
Lys34_contigs
Lys36_contigs
Lys24_contigs
Lys35_contigs
Lys68_contigs
Lys106_contigs
Lys88_contigs
Lys97_contigs
Lys76_contigs
Lys134_contigs
Lys58_contigs
Lys71_contigs
Lys81_contigs
Lys129_contigs
Lys120_contigs
Lys145_contigs
Lys137_contigs
Lys127_contigs
Lys152_contigs
Lys101_contigs
Lys98_contigs
Lys70_contigs
Lys133_contigs
Lys47_contigs
Lys75_contigs
Lys48_contigs
Lys148_contigs
Lys139_contigs
Lys141_contigs
Lys164_contigs
Lys149_contigs
Lys147_contigs
Lys60_contigs
Lys79_contigs
Lys168_contigs
Lys18_contigs
Lys87_contigs
Lys96_contigs
Lys7_contigs
Lys154_contigs
Lys117_contigs
Lys119_contigs
Lys178_contigs
Lys116_contigs
Lys86_contigs
Lys90_contigs
Lys41_contigs
Lys13_contigs
Lys85_contigs
5002_contigs
Lys12_contigs
Lys39_contigs
Lys14_contigs
Lys55_contigs
Lys29_contigs
Lys99_contigs
5035_contigs
Lys8_contigs
Lys3_contigs
5034_contigs
5088_contigs
Lys20_contigs
Lys78_contigs
Lys11_contigs
ANIm
0.9 0.92 0.94 0.96 0.98
Value
0100020003000400050006000
Color Key
and Histogram
Count
A
B1
B2
C
D
E
F
U
X
40
van der Wolf et al. (2014) Int. J. Syst. Evol. Micr. 64:768-774 doi:10.1099/ijs.0.052944-0
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Collinearity and Synteny
Genome rearrangements occur, but there can still be conservation
of sequence similarity and ordering.
• Two elements are collinear if they lie in the same linear
sequence
• Two elements are syntenous (or syntenic) if:
• (orig.) they lie on the same chromosome
• (mod.) there is conservation of blocks of order within the
same chromosome
Signs of evolutionary constraints, like sequence conservation
or synteny, may indicate functional genome regions.
Pyrococcus spp.a
a
Zivanovic et al. (2002) Nuc. Acids Res. 30:1902-1910 doi:10.1093/nar/30.9.1902
Comparison of Pyrococcus genomes (P. horikoshii, P. abyssi, P.
furiosus) shows chromosome-shuffling.
Transposition a major cause of genomic disruption
Vibrio mimicus a
a
Hasan et al. (2010) Proc. Natl. Acad. Sci. USA 107:21134-21139 doi:10.1073/pnas.1013825107
Chromosome C-II carries genes associated with environmental
adaptation; C-I carries virulence genes.
C-II has undergone extensive rearrangement; C-I has not.
Suggests modularity of genome organisation, as a mechanism for
adaptation (HGT, two-speed genome).
Serratia symbiotica a
a
Burke and Moran (2011) Genome Biol. Evol. 3:195-208 doi:10.1093/gbe/evr002
S. symbiotica is a recently evolved symbiont of aphids
Massive genomic decay is an adaptation to the new environment.
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Multiple genome alignment is hard
Can we not just align all our genomes, together?
No. Because it’s really, really hard.
Analogous to problems with multiple sequence alignment (three or
more sequences).
• Computationally extremely expensive (O(Ln), L=length of
sequence, n=number of sequences)
• NP-complete problem: no known efficient way to find a
solution
Heuristic (approximate) methods are used, most commonly:
• Progressive alignment
• Iterative alignment
Mauvea
a
Darling et al. (2004) Genome Res. 14:1394-1403 doi:10.1101/gr.2289704
Progressive alignment tool, with a GUI. Application to nine
enterobacteria: rearrangement of homologous backbone.
Alternatives include MLAGAN41 and MUMmer42
41
Brudno et al. (2003) Genome Res. 13:721-731 doi:10.1101/gr.926603
42
Kurtz et al. (2004) Genome Biol. 5:R12 doi:10.1186/gb-2004-5-2-r12
Mauve algorithma
a
Darling et al. (2004) Genome Res. 14:1394-1403 doi:10.1101/gr.2289704
1. Find local alignments
(multi-MUMs)
2. Build guide tree from
multi-MUMs
3. Select subset of
multi-MUMs as anchors,
and partition into Local
Collinear Blocks (LCBs):
consistently ordered
subsets
4. Progressive alignment
against guide tree
Reordering contigsa
a
Darling et al. (2004) Genome Res. 14:1394-1403 doi:10.1101/gr.2289704
Mauve also enables draft genome reordering.
Once LCBs are identified, can apply Mauve Contig Mover to
reorder contigs
Example exercise in data/whole_genome_alignment
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Chromosome paintinga
a
Yahara et al. (2013) Mol. Biol. Evol. 30:1454-1464 doi:10.1093/molbev/mst055
“Chromosome painting” infers recombination-derived ‘chunks’
Genome’s haplotype constructed in terms of recombination events
from a ‘donor’ to a ‘recipient’ genome
Chromosome paintinga
a
Yahara et al. (2013) Mol. Biol. Evol. 30:1454-1464 doi:10.1093/molbev/mst055
Recombination events summarised in a coancestry matrix.
H. pylori: most within geographical bounds, but asymmetrical
donation from Amerind/East Asian to European isolates.
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
P.aeruginosa nosocomial acquisitiona
a
Quick et al. (2014) BMJ Open 4: e006278. doi:10.1136/bmjopen-2014-006278
Motivation
Nosocomial water transmission of P.aeruginosa an urgent concern
Setup
Burns patients (30) screened for P.aeruginosa on admission
Samples taken from patients and environment
All P.aeruginosa isolates (141) WGS sequenced
Outcome
Clustering of isolates by room and outlet
Three patient isolates identical to water isolates from same room
Biofilm from thermostatic mixer valve a possible source
P.aeruginosa nosocomial acquisitiona
a
Quick et al. (2014) BMJ Open 4: e006278. doi:10.1136/bmjopen-2014-006278
P.aeruginosa nosocomial acquisitiona
a
Quick et al. (2014) BMJ Open 4: e006278. doi:10.1136/bmjopen-2014-006278
Methods
• Illumina MiSeq WGS of 141 isolates
• Metagenomic sequencing of biofilm
• Simulated sequencing of 55 published P. aeruginosa
• BWA mapping against PAO1 reference genome
• SNPs called with SAMtools & VarScan
• ML reconstruction with FastTree
• De novo assembly with Velvet for MLST prediction
Sequences and bioinformatic methods shared online:
http://www.github.com/joshquick/snp calling scripts
P.aeruginosa nosocomial acquisitiona
a
Quick et al. (2014) BMJ Open 4: e006278. doi:10.1136/bmjopen-2014-006278
P.aeruginosa nosocomial acquisitiona
a
Quick et al. (2014) BMJ Open 4: e006278. doi:10.1136/bmjopen-2014-006278
P.aeruginosa nosocomial acquisitiona
a
Quick et al. (2014) BMJ Open 4: e006278. doi:10.1136/bmjopen-2014-006278
Strengths
A P. aeruginosa source could be tracked by WGS
Insights into transmission: water to patient a likely route
Sensitivity - identifies microevolution
Limitations
Small sample size: 5/30 patients infected, gave 55/141 isolates
Not clear that causal inferences are general
300-day sampling, not real-time crisis analysis
Good existing reference genome set for this bacterium
Sequencing cost: ≈£8k; Staff cost: ≈£15k
Microbial Genomics and
Bioinformatics
BM405
4.Genome Features
Leighton Pritchard1,2,3
1
Information and Computational Sciences,
2
Centre for Human and Animal Pathogens in the Environment,
3
Dundee Effector Consortium,
The James Hutton Institute, Invergowrie, Dundee, Scotland, DD2 5DA
Acceptable Use Policy
Recording of this talk, taking photos, discussing the content using
email, Twitter, blogs, etc. is permitted (and encouraged),
providing distraction to others is minimised.
These slides will be made available on SlideShare.
These slides, and supporting material including exercises, are
available at https://github.com/widdowquinn/Teaching-
Strathclyde-BM405
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Genome Features
• Genome features are annotated regions of the genome.
• Typically represent functional elements.
• May be simple (single region), or complex (subfeatures)
Why annotate genome features?
• Almost all use of genomics depends on annotation:
annotation quality is critical to downstream use of
genomics in biology
• Annotation is curation (a live, active process), not cataloguing
• Automated annotation from curated data (public databases) is
the only game in town, given the data quantities we generate
• But you can’t propagate something that doesn’t exist: up to
30% of metabolic activity has no known gene associated
with it43
• Biocurators can spend as much time “de-annotating”
literature-based annotations as entering new data44
43
Chen and Vitkup (2007) Trends Biotech. doi:10.1016/j.tibtech.2007.06.001
44
Bairoch (2009) Nat. Preced. doi:10.1038/npre.2009.3092.1
Gene Features
Gene features have significant substructure, especially in
eukaryotes.
• 5‘ UTR
• translation start
• intron start/stop
• exon start/stop
• translation stop
• translation
terminator
• 3‘ UTR
ncRNA Features
• tRNA - transfer RNA
• rRNA - ribosomal RNA
• CRISPRs -
bacterial/archaeal defence
(used for genome editing)
• many other classes
Regulatory/Repeat Features
Regulatory sites
• transcription start sites
• RNA polymerase binding sites
• Transcription Factor Binding Sites (TFBS)
Repetitive regions and mobile elements
• tandem repeats
• (retro-)transposable elements
• phage inclusions
Principles of feature prediction
Two main approaches to feature prediction:
• ab initio prediction - start from first principles, using only the
genome sequence:
• Unsupervised methods - not trained on a dataset
• Supervised methods - trained on a dataset
• homology matches
• alignment to features from related organisms (comparative
genomics, annotation transfer)
• from known gene products (e.g. proteins, ncRNA)
• from transcripts/other intermediates (e.g. ESTs, cDNA,
RNAseq)
Dedicated tools available for many different classes of feature.
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Prokaryotic CDS Prediction Methods
Using CDS prediction as an illustrative example for all feature
prediction.
Sequence conservation (evolutionary constraint; an unsupervised, a
priori method) can be useful
• Prokaryotes “easier” than eukaryotes for gene/CDS prediction
• Less uncertainty in predictions (isoforms, gene structure)
• Very gene-dense (over 90% of chromosome is coding sequence)
• No intron-exon structure
Prokaryotic CDS Prediction Methods
ORFs are plentiful:
• Problem is: “which possible ORF contains the true gene, and
which start site is correct?”
• Still not a solved problem
Finding Open Reading Frames
The simplest approach: find ORFs (sequence between two
consecutive in-frame stop codons)
• ORF finding is naive, does not consider:
• Start codon
• Promoter/RBS motifs
• Wider context (e.g. overlapping genes)
Dedicated tools, e.g. Glimmer, Prodigal, RAST, GeneMarkS
usually better.
Two ab initio CDS Prediction Tools
• Glimmer45
• Interpolated Markov models
• Can be trained on “gold standard” datasets
• Prodigal46
• Log-likelihood model based on GC frame plots, followed by
dynamic programming
• Can be trained on “gold standard” datasets
Applying these to an example bacterial chromosome. . .
45
Delcher et al. (2007) Bioinformatics 23:673-679 doi:10.1093/bioinformatics/btm009
46
Hyatt et al. (2010) BMC Bioinf. 11:119 doi:10.1186/1471-2105-11-119
Comparing predictions in Artemisa
a
Carver et al. (2012) Bioinformatics 28:464-469 doi:10.1093/bioinformatics/btr703
Not every ORF (green) is predicted to encode for a coding
sequence (CDS; blue/orange).
Self-contradictory CDS calls (orange); even automated annotation
needs manual curation.
Comparing predictions in Artemis
Glimmer(green)/Prodigal(blue) CDS prediction methods do not
always agree (presence/absence, start position).
How do we know which (if either) is best?
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Using a “Gold Standard”: validationa
a
Pritchard and Broadhurst (2014) Methods Mol. Biol. 1127:53-64 doi:10.1007/978-1-62703-986-4 4
A general approach for all predictive methods
• Define a known, “correct” set of true/false, positive/negative
etc. examples - the “gold standard”
• Evaluate your predictive method against that set for
• sensitivity, specificity, accuracy, precision, etc.
This ought to be done by the method developers, but often wise to
evaluate in your own system.
Many methods available, coverage beyond the scope of this
introduction
Contingency Tables
Condition (Gold standard)
True False
Test outcome
Positive True Positive False Positive
Negative False Negative True Negative
Performance Metrics
Sensitivity = TPR = TP/(TP + FN)
Specificity = TNR = TN/(FP + TN)
FPR = 1 − Specificity = FP/(FP + TN)
If you don’t have this information, you can’t interpret
predictive results properly.
“Gold Standard” results
• Tested glimmer47 and prodigal48 on two enterobacterial
close relatives as “gold standards” (still not perfect. . .)
1. Manually annotated (>3 expert person years)
2. Community-annotated (many research groups, interested in
their own subset of genes)
• Both methods trained directly on the annotated genes in
each organism!
47
Delcher et al. (2007) Bioinformatics 23:673-679 doi:10.1093/bioinformatics/btm009
48
Hyatt et al. (2010) BMC Bioinf. 11:119 doi:10.1186/1471-2105-11-119
“Gold Standard” results
Manually annotated: 4550 CDS
genecaller glimmer prodigal
predicted 4752 4287
missed 284 (6%) 407 (9%)
Exact Prediction
sensitivity 62% 71%
FDR 41% 25%
PPV 59% 75%
Correct ORF
sensitivity 94% 91%
FDR 10% 3%
PPV 90% 97%
“Gold Standard” results
Community annotated: 4475 CDS
genecaller glimmer prodigal
predicted 4679 4467
missed 112 (3%) 156 (3%)
Exact Prediction
sensitivity 62% 86%
FDR 31% 14%
PPV 69% 86%
Correct ORF
sensitivity 97% 97%
FDR 7% 3%
PPV 93% 97%
Gene/CDS Prediction
• Alternative CDS (and all other) prediction methods are
unlikely to give identical results, or perform equally well
• There is No Free Lunch (this is a theorem:
http://en.wikipedia.org/wiki/No free lunch theorem)
• To assess/choose between methods, performance metrics are
required
• Even on prokaryotes (a relatively simple case), current best
methods for CDS prediction are imperfect
• Manual correction is often required (usually the most
demanding and time-consuming part of the process).
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Prokaryotic Annotation Pipelinesa
a
Richardson and Watson (2012) Brief. Bioinf. 14:1-12 doi:10.1093/bib/bbs007
Many choices, including RAST49, PROKKA50, BaSYS51, etc.
Often perform both CDS/feature calling and functional prediction.
Two broad approaches:
1. Heavyweight: maintain database and resource, often
annotating by homology, e.g. RAST
2. Lightweight: chain together multiple third-party packages, e.g.
PROKKA
Pipelines take a lot of tedium (and control) out of annotating
bacterial genomes, but have the same issues as every other
prediction tool.
49
Aziz et al. (2008) BMC Genomics 9:75 doi:10.1186/1471-2164-9-75
50
Seemann (2014) Bioinformatics 30:2068-2069 doi:10.1093/bioinformatics/btu153
51
Van Domselaar et al. (2005) Nuc. Acids Res. 33:W455-W459 doi:10.1093/nar/gki593
PROKKAa
a
Seemann (2014) Bioinformatics 30:2068-2069 doi:10.1093/bioinformatics/btu153
• Lightweight, and fast.
• Runs locally. (5Mbp
genome takes ≈10min on
my desktop; more detailed
ncRNA prediction takes
≈20min)
• Flexible: built-in
databases can be replaced
by user databases.
• Uses freely-accessible
third-party tools for
prediction
Simple to run (at the command-line, or in Galaxy52).
52
Goecks et al. (2010) Genome Biol. 11:R86 doi:10.1186/gb-2010-11-8-r86
RASTa
a
Aziz et al. (2008) BMC Genomics 9:75 doi:10.1186/1471-2164-9-75
• Server-based
(http://rast.nmpdr.org/).
Queues likely.
• Relies on SEED and
FIGFam databases, held
at NMPDR
• FIGFam: isofunctional
homologue families
• Produces metabolic
reconstruction
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Principles of function prediction
At genome scale, we realistically have to automate function
prediction.
Function prediction is just like any other prediction method.
“Does this sequence imply that function?”
Two main approaches to function prediction:
• ab initio prediction (on basis of feature sequence/context
only)
• Unsupervised methods - not trained on an exemplar dataset
• Supervised methods - trained on an exemplar dataset
• homology matches (sequence similarity)
• alignment to features with known/predicted functions
Homology-based function prediction
Two proteins with similar sequence may have similar function.
But. . .
• How similar do they have to be (and where) to share the same
function?
• What do we mean by ‘same function’, anyway?
Interaction/substrate specificity? Participation in a pathway?
Contribution to a structure? Biochemical interconversion? . . .
• How confident can we be in the comparator (annotated)
sequence: was that function determined experimentally?
Gene Ontology (GO)a
a
Ashburner et al. (2000) Nat. Genet. 25:25-29 doi:10.1038/75556
The Gene Ontology provides a common vocabulary for describing
biological function, and unifying functional descriptions.
Ontologies (controlled vocabularies) are central to
information-sharing.
Gene Ontology Consortium: http://geneontology.org/
Many annotation tools and databases produce GO output, or
compatible controlled vocabulary terms, e.g.
• Blast2GO53: BLAST-based annotation
• PHI-Base54: microbial pathogen-host interaction specific
functions
• GOPred55: combines several protein function classifiers
53
Conesa et al. (2005) Bioinformatics 21:3674-3676 doi:10.1093/bioinformatics/bti610
54
Winnenburg et al. (2006) Nuc. Acids Res. 34:D459-D464 doi:10.1093/nar/gkj047
55
Sarac et al. (2010) PLoS One 5:e12382 doi:10.1371/journal.pone.0012382
Gene Ontology (GO)a
a
Ashburner et al. (2000) Nat. Genet. 25:25-29 doi:10.1038/75556
Gene Ontology (GO)a
a
Ashburner et al. (2000) Nat. Genet. 25:25-29 doi:10.1038/75556
Are database annotations reliable?a
a
Schnoes et al. (2013) PLoS Comp. Biol. 9:e1003063 doi:10.1371/journal.pcbi.1003063
Are protein function annotations in databases determined
experimentally, or by annotation transfer?
High throughput experiments and genome annotations are
conducted without validation of function, and placed in databases.
• GO databases record annotation origin by publication
• GO databases record evidence codes, e.g.: EXP=Inferred
from Experiment; ISS=Inferred from Sequence Similarity
• 0.14% of contributing publications provide 25% of all
experimentally validated annotations in the Uniprot-GOA
compilation.
• There are biases in functional annotation.
No clear solution to this kind of bias - but we have to recognise
and account for it.
Are database annotations reliable?a
a
Radivojac et al. (2013) Nat. Meth. 10:221-227 doi:10.1038/nmeth.2340
The Critical Assessment of Function Annotation (CAFA) project.
Do biased database annotations matter?
Experimental annotations of proteins are incomplete. But is that
important?
Tested by simulation, and following databases for three years.56
1. Yes. It matters.
2. Current large scale annotations are meaningful and almost
surprisingly reliable.
3. The nature and level of data incompleteness, and type of
classification model have an effect.
4. “Low precision, high recall” (i.e. less discriminating) tools
most significantly affected.
Molecular function prediction is usually more reliable than
biological process prediction57
56
Jiang et al. (2014) Bioinformatics 30:i609-i616 doi:10.1093/bioinformatics/btu472
57
Cozzetto et al. (2013) BMC Bioinf. 14:S3-S1 doi:10.1186/1471-2105-14-S3-S1
CAFA resultsa
a
Radivojac et al. (2013) Nat. Meth. 10:221-227 doi:10.1038/nmeth.2340
The Critical Assessment of Function Annotation (CAFA) 2013
results. (F-measure combines precision and recall)
• You can do better than
BLAST.
• Best-performing methods
do comparably well.
• Best methods used
evolutionary relationships,
structure, and expression
data.
• Machine Learning works
best.
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
A wee trip to the doctor
• You go for a checkup, and are tested for disease X
• The test has sensitivity = 0.95 (predicts disease where there
is disease)
• The test has FPR = 0.01 (predicts disease where there is no
disease)
A wee trip to the doctor
• You go for a checkup, and are tested for disease X
• The test has sensitivity = 0.95 (predicts disease where there
is disease)
• The test has FPR = 0.01 (predicts disease where there is no
disease)
• Your test is positive
• What is the probability that you have disease X?
• 0.01, 0.05, 0.50, 0.95, 0.99?
• (Audience Participation!)
A wee trip to the doctor
• What is the probability that you have disease X?
• Unless you know the baseline occurrence of disease X,
you cannot determine this.
A wee trip to the doctor
• What is the probability that you have disease X?
• Unless you know the baseline occurrence of disease X,
you cannot determine this.
• Baseline occurrence: fX
• fX = 0.01 =⇒ P(disease|+ve) = 0.490 ≈ 0.5
• fX = 0.8 =⇒ P(disease|+ve) = 0.997 ≈ 1.0
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Why Performance Metrics Mattera
a
Pritchard and Broadhurst (2014) Methods Mol. Biol. 1127:53-64 doi:10.1007/978-1-62703-986-4 4
• Imagine a paper describing a predictor for protein functional
class (e.g. Type III effector)
• The paper reports sensitivity = 0.95, FPR = 0.01
• You run the predictor on 4,500 proteins in a new genome
• It predicts 50 members of the class. How many of them are
likely to be true positives?
Why Performance Metrics Mattera
a
Pritchard and Broadhurst (2014) Methods Mol. Biol. 1127:53-64 doi:10.1007/978-1-62703-986-4 4
• Imagine a paper describing a predictor for protein functional
class (e.g. Type III effector)
• The paper reports sensitivity = 0.95, FPR = 0.01
• You run the predictor on 4,500 proteins in a new genome
• It predicts 50 members of the class. How many of them are
likely to be true positives?
• We need a baseline level of that class (fX ) in the genome to
determine this.
• We estimate ≈ 45 members in protein complement, so
fX = 0.01
• fX = 0.01 =⇒ P(class|+ve) = 0.490 ≈ 0.5
Bayes’ Theorem
• May seem counter-intuitive: 95% sensitivity, 99% specificity
=⇒ 50% chance of any prediction being incorrect
• Probability given by Bayes’ Theorem
• P(X|+) = P(+|X)P(X)
P(+|X)P(X)+P(+| ¯X)P( ¯X)
Let’s play a game. . .
2, 4, 6, . . .
Bayes’ Theorem
• May seem counter-intuitive: 95% sensitivity, 99% specificity
=⇒ 50% chance of any prediction being incorrect
• Probability given by Bayes’ Theorem
• P(X|+) = P(+|X)P(X)
P(+|X)P(X)+P(+| ¯X)P( ¯X)
• This step commonly overlooked in the literature
• confirmation bias
• people want to see positive examples/tell a story
• people want to think their predictor works
A cautionary talea
a
Arnold et al. (2009) PLoS Pathog. 5:e1000376 doi:10.1371/journal.ppat.1000376
• Paper describes EffectiveT3, a type III effector prediction
tool
• Reported sensitivity ≈ 0.71, FPR ≈ 0.15
• Applied tool to 739 complete bacterial and archaeal genomes
A cautionary talea
a
Arnold et al. (2009) PLoS Pathog. 5:e1000376 doi:10.1371/journal.ppat.1000376
• Paper describes EffectiveT3, a type III effector prediction
tool
• Reported sensitivity ≈ 0.71, FPR ≈ 0.15
• Applied tool to 739 complete bacterial and archaeal genomes
• Organisms with an identifiable T3SS: 2-7% of genome
predicted to be secreted
• Organisms without an identifiable T3SS (or known not
to have one): 1-10% of genome predicted to be secreted
• “The surprisingly high number of (false) positives in genomes
without T3SS exceeds the expected false positive rate”
• This is not a surprise, statistically.
A cautionary talea
a
Arnold et al. (2009) PLoS Pathog. 5:e1000376 doi:10.1371/journal.ppat.1000376
Probability that an EffectiveT3 positive prediction corresponds
to a secreted protein is given by Bayes’ Theorem
• P(X|+) = P(+|X)P(X)
P(+|X)P(X)+P(+| ¯X)P( ¯X)
• P(+|X) = sensitivity = 0.71
• P(+| ¯X) = FPR = 0.15
• P(X) = base rate ≈ 0.03 (58)
• =⇒ P(X|+) ≈ 0.13
Only 13% of predictions likely to be positive!
How many predicted type III secreted proteins were there. . .
58
Boch and Bonas (2010) Annu. Rev. Phytopathol. 48:419-436 doi:10.1146/annurev-phyto-080508-081936
A cautionary talea
a
Arnold et al. (2009) PLoS Pathog. 5:e1000376 doi:10.1371/journal.ppat.1000376
Interpreting genome-scale predictionsa
a
Pritchard and Broadhurst (2014) Methods Mol. Biol. 1127:53-64 doi:10.1007/978-1-62703-986-4 4
• Statistics at genome-scale can be counterintuitive.
• Use Bayes’ Theorem!
• Predictions identify groups, not individual members of the
group. e.g.
• Test for airport smugglers has P(smuggler|+) = 0.9
• Test gives 100 positives
• Which specific individuals are truly smugglers?
Interpreting genome-scale predictionsa
a
Pritchard and Broadhurst (2014) Methods Mol. Biol. 1127:53-64 doi:10.1007/978-1-62703-986-4 4
• Statistics at genome-scale can be counterintuitive.
• Use Bayes’ Theorem!
• Predictions identify groups, not individual members of the
group. e.g.
• Test for airport smugglers has P(smuggler|+) = 0.9
• Test gives 100 positives
• Which specific individuals are truly smugglers?
• The test does not allow you to determine this - you need more
evidence for each individual
• Same principle applies to other classifiers, (including protein
functional class prediction) - watch for ‘cherry-picking’ in
publications
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Reconstructing metabolisma
a
Thiele and Palsson (2010) Nat. Protoc. 5:93-121 doi:10.1038/nprot.2009.203
Once metabolic functional annotation has been assigned to
features, we can do comparative analysis of metabolism.
Dynamic models of metabolisma
a
Orth et al. (2010) Nat. Biotech. 28:245-248 doi:10.1038/nbt.1614
By using constraint-based models (e.g. Flux Balance Analysis), we
can make these into dynamic representations of bacterial
metabolism.
• Upper, lower bounds to reaction rates
• Define objective phenotype
• Calculate conditions resulting in flux
• in silico knockouts
E. coli metabolisma
a
Monk et al. (2013) Proc. Natl. Acad. Sci. USA 110:20338-20343 doi:10.1073/pnas.1307797110
E. coli has a very long history of metabolic reconstruction59
Recent modelling work predicts which nutrients support growth
59
Reed and Palsson (2000) J. Bact. 185:2692-2699 doi:10.1128/JB.185.9.2692-2699.2003
E. coli metabolisma
a
Baumler et al. (2011) BMC Syst. Biol. 5:182 doi:10.1186/1752-0509-5-182
Models are complex, and experimental validation is essential
There’s more we don’t know. . .
Microbial Genomics and
Bioinformatics
BM405
5.Finding Equivalent Features
Leighton Pritchard1,2,3
1
Information and Computational Sciences,
2
Centre for Human and Animal Pathogens in the Environment,
3
Dundee Effector Consortium,
The James Hutton Institute, Invergowrie, Dundee, Scotland, DD2 5DA
Acceptable Use Policy
Recording of this talk, taking photos, discussing the content using
email, Twitter, blogs, etc. is permitted (and encouraged),
providing distraction to others is minimised.
These slides will be made available on SlideShare.
These slides, and supporting material including exercises, are
available at https://github.com/widdowquinn/Teaching-
Strathclyde-BM405
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
What makes genome features equiva-
lent?
When we compare two features (e.g. genes) between two or more
genomes, there must be some basis for making the comparison
That is, they have to be equivalent in some way, such as:
• common evolutionary origin
• functional similarity
• a family-based relationship
It’s common to define equivalence of genome features in terms of
evolutionary relationship.
Why look at equivalent features?
The real power of genomics is comparative genomics!
• Makes catalogues of genome components comparable between
organisms
• Differences, e.g. presence/absence of equivalents may support
hypotheses for functional or phenotypic difference
• Can identify characteristic signals for diagnosis/epidemiology
• Can build parts lists and wiring diagrams for systems and
synthetic biology
Evolutionary relationshipsa
a
Fitch (1970) Syst. Zool. 19:99-113 doi:10.2307/2412448
Equivalencies and relationships can be quite complex.
We need precise terms to describe relationships between genome
features.
• analogy: functional similarity
• homology: evolutionary common ancestor
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Who let the -logues out?a
a
Fitch (2000) Trends Genet. 16:227-231 doi:10.1016/S0168-9525(00)02005-9
• homologues: elements that are similar because they share a
common ancestor. There are NOT degrees of homology
• analogues: elements that are (functionally?) similar, and this
may be through common ancestry or some other means, e.g.
convergent evolution
• orthologues: homologues that diverged through speciation
• paralogues: homologues that diverged through duplication
within the same genome
Who let the -logues out?
Who let the -logues out?
Who let the -logues out?
Who let the -logues out?
Who let the -logues out?
ITYFIALMCTTa
a
Kristensen et al. (2011) Brief. Bioinf. 12:379-391 doi:10.1093/bib/bbr030
But it’s a little more complicated than that.
Biology is not well-behaved.
• Gene loss
• Homologues may diverge so widely that they can be hard to
recognise
• Reconstructed evolutionary trees may not be robust inferences
of speciation (or relevant to it, in prokaryotes)
• There is no record of history - we can only make inferences
All classifications of orthology/paralogy are inferences!
ITYFIALMCTTa
a
Kristensen et al. (2011) Brief. Bioinf. 12:379-391 doi:10.1093/bib/bbr030
All classifications of orthology/paralogy are inferences!
Ensembl Comparaa
a
Vilella et al. (2009) Genome Res. 19:327-335 doi:10.1101/gr.073585.107
Some tools/databases, e.g. Ensembl Compara, use slightly
different definitions (almost everything’s an “orthologue”)
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Why focus on orthologues?
Formalise the idea of corresponding genes in different organisms.
Orthologues serve two purposes:
• Evolutionary equivalence
• Functional equivalence (“The Ortholog Conjecture”60)
Applications in comparative genomics, functional genomics and
phylogenetics.61
Over 30 databases attempt to describe orthologous relationships
(http://questfororthologs.org/orthology databases62)
60
Chen and Zhang (2012) PLoS Comp. Biol. 8:e1002784 doi:10.1371/journal.pcbi.1002784
61
Dessimoz (2011) Brief. Bioinf. 12:375-376 doi:10.1093/bib/bbr057
62
Altenhoff and Dessimoz (2009) PLoS Comp. Biol. 5:e1000262 doi:10.1371/journal.pcbi.1000262
Finding orthologues
Multiple methods and databases63,64,65
• Pairwise genome
• RBBH (aka BBH, RBH),
RSD, InParanoid, RoundUp
• Multi-genome
• Graph-based: COG, eggNOG,
OrthoDB, OrthoMCL, OMA,
MultiParanoid
• Tree-based: TreeFam,
Ensembl Compara,
PhylomeDB, LOFT
63
Kristensen et al. (2011) Brief. Bioinf. 12:379-391 doi:10.1093/bib/bbr030
64
Trachana et al. (2011) Bioessays 33:769-780 doi:10.1002/bies.201100062
65
Salichos and Rokas (2011) PLoS One 6:e18755 doi:10.1371/journal.pone.0018755.g006
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Which prediction methods work best?
Taking advantage of prokaryotic operon structure: if the outer
pair of a syntenic triplet of genes are orthologous, the middle
gene is also likely to be orthologous.66
Specifically testing reciprocal best hits (RBH).
66
Wolf and Koonin (2012) Genome Biol. Evil. 4:1286-1294 doi:10.1093/gbe/evs100
Which prediction methods work best?
• Tested on 573 prokaryotic genomes
• 88-99% of RBH found in syntenic triplets
• Overwhelming majority of middle genes are RBH
RBH reliably finds orthologues.67
67
Wolf and Koonin (2012) Genome Biol. Evil. 4:1286-1294 doi:10.1093/gbe/evs100
Which prediction methods work best?
Four methods tested against 2,723 curated orthologues from six
Saccharomycetes
• RBBH (and cRBH); RSD (and cRSD); MultiParanoid;
OrthoMCL
• Rated by statistical performance metrics: sensitivity,
specificity, accuracy, FDR
cRBH most accurate and specific, with lowest FDR.68
68
Salichos and Rokas (2011) PLoS One 6:e18755 doi:10.1371/journal.pone.0018755.g006
Which prediction methods work best?
Testing on literature-based benchmarks for grouping by function
and correct branching of phylogeny.69
69
Altenhoff and Dessimoz (2009) PLoS Comp. Biol. 5:e1000262 doi:10.1371/journal.pcbi.1000262
Which prediction methods work best?
• Performance varies by choice of method, and interpretation of
“orthology”
• Biggest influence is genome annotation quality
• Relative performance varies with choice of benchmark
• (clustering) RBH outperforms more complex algorithms
under many circumstances
What is this magic RBH method?
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Functional adaptation in Pbaa
a
Toth et al. (2006) Ann. Rev. Phytopath. 44:305-336 doi:10.1146/annurev.phyto.44.070505.143444
Functional adaptation in Pbaa
a
Toth et al. (2006) Ann. Rev. Phytopath. 44:305-336 doi:10.1146/annurev.phyto.44.070505.143444
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Core genome
Once equivalent genes have been identified, those present in all
related isolates can be identified: the core genome.
The core genome is expected to underpin common function.
A core RBH cluster (clique) for 29 genomes:
Accessory genome
The remaining genes are the accessory genome, and are
expected to mediate function that distinguishes between isolates.
An accessory RBH cluster for 29 genomes:
Accessory clusters
Accessory RBH clusters can be pruned, to identify the accessory
genome specific to subgroups of isolates:
These genes may be responsible for subgroup-specific phenotypes
Accessory genome
Accessory genomes act as a cradle for adaptive evolution70
This is particularly so for pathogens, such as Pseudomonas spp.71
70
Croll and Mcdonald (2012) PLoS Path. 8:e1002608 doi:10.1371/journal.ppat.1002608
71
Baltrus et al. (2011) PLoS Path. 7:e1002132 doi:10.1371/journal.ppat.1002132.t002
Core genome synteny
Using tools like i-ADHoRe72 that identify synteny and collinearity,
the structural organisation of the core genome can be determined:
For Dickeya, the core genome appears to be structurally
well-conserved across all isolates.
72
Proost et al. (2012) Nuc. Acids Res. 40:e11 doi:10.1093/nar/gkr955
Panseqa
a
Laing et al. (2010) BMC Bioinf. 11:461 doi:10.1186/1471-2105-11-461
Panseq is an online tool for identification of core and accessory
genomes, available at https://lfz.corefacility.ca/panseq/, and
https://github.com/chadlaing/Panseq for standalone use
Harvesta
a
Treangen et al. (2014) Genome Biol. 15:524 doi:10.1186/s13059-014-0524-x
Visualising and organising comparison/pangenome data across
thousands of bacteria is difficult.
The Harvest suite of tools enables alignment and visualisation of
thousands of genomes:
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Things I didn’t get to
Table of Contents
Introduction
A personal view
Erwinia carotovora subsp. atroseptica
Dickeya spp., Campylobacter spp., and Escherichia
coli
So what’s changed?
High Throughput Sequencing
Three revolutions, four dominant technologies
Benchmarking
Nanopore
How fast is sequence data increasing?
Sequence Data Formats
FASTQ
SAM/BAM/CRAM
Repositories
Assembly
Overlap-Layout-Consensus
de Bruijn graph assembly
Read Mapping
Short-Read Sequence Alignment
The Assembly
What you get back
Comparative Genomics
Computational Comparative Genomics
Bulk Genome Properties
Nucleotide Frequency/Genome Size
Whole Genome Alignment
An Introduction to Pairwise Genome Alignment
Average Nucleotide Identity
Whole Genome Alignment in Practice
Ordering Draft Genomes By Alignment
Chromosome painting
Nosocomial P.aeruginosa acquisition
Genome Features
What are genome features?
Prokaryotic CDS Prediction
Assessing Prediction Methods
Prokaryotic Annotation Pipelines
Genome-Scale Functional Annotation
Functional Annotation
A visit to the doctor
Statistics of genome-scale prediction
Building to Metabolism
Reconstructing metabolism
Equivalent Genome Features
What makes genome features equivalent?
Homology, Orthology, Paralogy
Who let the -logues out?
What’s so important about orthologues?
Evaluating orthologue prediction
Using orthologue predictions
Core and Pan-genomes
Conclusions
Things I Didn’t Get To
Conclusions
Conclusions
Conclusions
Conclusions
Licence: CC-BY-SA
By: Leighton Pritchard
This presentation is licensed under the Creative Commons
Attribution ShareAlike license
https://creativecommons.org/licenses/by-sa/4.0/

More Related Content

PPTX
GENOMICS AND BIOINFORMATICS
PPT
Phylogenetic studies
PPTX
Gene identification using bioinformatic tools.pptx
DOCX
Open Reading Frames
PPTX
Synthetic Genome
PDF
Secondary Structure Prediction of proteins
PPTX
Gene cloning strategies
PPTX
Phagemid vector
GENOMICS AND BIOINFORMATICS
Phylogenetic studies
Gene identification using bioinformatic tools.pptx
Open Reading Frames
Synthetic Genome
Secondary Structure Prediction of proteins
Gene cloning strategies
Phagemid vector

What's hot (20)

PPTX
P1, mac and pac vector
PPTX
Protein interaction, types by kk sahu
PPT
Clustal
PDF
PPTX
Transcriptomics
PPTX
immuno-VDJ recombination.pptx
PPTX
DNA Sequencing
PPTX
Dna sequencing and its types
PPT
Phylogenetic Tree, types and Applicantion
PPTX
Site directed mutgenesis, OLIGONUCLEOTIDE DIRECTED MUTAGENESIS
PPT
PPTX
ZINC FINGER NUCLEASE TECHNOLOGY
PPTX
Genome sequencing
PPT
Genome Sequencing Project
PPTX
MAMMALIAN CELL EXPRESSION SYSTEM, STRONG PROMOTERS.pptx
PPTX
Cell cytotoxicity assays
PPT
RESTRICTION MAPPING
DOCX
UniProt
PPTX
YEAST TWO HYBRID SYSTEM
P1, mac and pac vector
Protein interaction, types by kk sahu
Clustal
Transcriptomics
immuno-VDJ recombination.pptx
DNA Sequencing
Dna sequencing and its types
Phylogenetic Tree, types and Applicantion
Site directed mutgenesis, OLIGONUCLEOTIDE DIRECTED MUTAGENESIS
ZINC FINGER NUCLEASE TECHNOLOGY
Genome sequencing
Genome Sequencing Project
MAMMALIAN CELL EXPRESSION SYSTEM, STRONG PROMOTERS.pptx
Cell cytotoxicity assays
RESTRICTION MAPPING
UniProt
YEAST TWO HYBRID SYSTEM
Ad

Viewers also liked (20)

PPTX
Database management system
PPT
03 Object Dbms Technology
PDF
Comparative Genomics with GMOD and BioPerl
PPTX
Biocuration2012 Eugeni Belda
PPT
Comparative genomics
PPTX
Comparative genomics
PPTX
Comparative genomics and proteomics
PDF
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
PPTX
Protein database ..... of NCBI
PPT
Bioinformatics
PPT
Biodatabases 101220022654-phpapp02
PPTX
What is comparative genomics
PPTX
Comparative genomics
PPTX
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
PPTX
Bioinformatics Final Presentation
PPT
Bioinformatics
PPTX
Illumina Sequencing
PPTX
Application of bioinformatics
PPT
Application of Bioinformatics in different fields of sciences
Database management system
03 Object Dbms Technology
Comparative Genomics with GMOD and BioPerl
Biocuration2012 Eugeni Belda
Comparative genomics
Comparative genomics
Comparative genomics and proteomics
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Protein database ..... of NCBI
Bioinformatics
Biodatabases 101220022654-phpapp02
What is comparative genomics
Comparative genomics
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
Bioinformatics Final Presentation
Bioinformatics
Illumina Sequencing
Application of bioinformatics
Application of Bioinformatics in different fields of sciences
Ad

Similar to Microbial Genomics and Bioinformatics: BM405 (2015) (20)

PPTX
Plant Pathogen Genome Data: My Life In Sequences
PPTX
Toolbox for bacterial population analysis using NGS
PDF
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
PDF
poster
PPTX
2014 marine-microbes-grc
PPTX
Whole genome sequencing of bacteria & analysis
PDF
Comparative Genomics and Visualisation BS32010
PPTX
GLBIO/CCBC Metagenomics Workshop
PDF
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
PPTX
Imgc2011 bioinformatics tutorial
PPTX
2014 nyu-bio-talk
PPTX
2015 mcgill-talk
PPT
Bio305 genome analysis and annotation 2012
PDF
Pizza club - May 2016 - Shaman
PPTX
Bioinformatics t8-go-hmm v2014
PPTX
Beiko cms final
PPTX
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
PPT
High-Throughput Sequencing
PPTX
2015 osu-metagenome
PPTX
proteome.pptx
Plant Pathogen Genome Data: My Life In Sequences
Toolbox for bacterial population analysis using NGS
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
poster
2014 marine-microbes-grc
Whole genome sequencing of bacteria & analysis
Comparative Genomics and Visualisation BS32010
GLBIO/CCBC Metagenomics Workshop
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
Imgc2011 bioinformatics tutorial
2014 nyu-bio-talk
2015 mcgill-talk
Bio305 genome analysis and annotation 2012
Pizza club - May 2016 - Shaman
Bioinformatics t8-go-hmm v2014
Beiko cms final
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
High-Throughput Sequencing
2015 osu-metagenome
proteome.pptx

More from Leighton Pritchard (20)

PDF
In a Different Class?
PDF
RDVW Hands-on session: Python
PDF
Little Rotters: Adventures With Plant-Pathogenic Bacteria
PDF
Pathogen Genome Data
PDF
Reverse-and forward-engineering specificity of carbohydrate-processing enzymes
PDF
Whole genome taxonomic classi cation for prokaryotic plant pathogens
PDF
Microbial Agrogenomics 4/2/2015, UK-MX Workshop
PDF
BM405 Lecture Slides 21/11/2014 University of Strathclyde
PDF
Sequencing and Beyond?
PDF
Highly Discriminatory Diagnostic Primer Design From Whole Genome Data
PDF
ICSB 2013 - Visits Abroad Report
PDF
Adventures in Bioinformatics (2012)
PDF
Golden Rules of Bioinformatics
PPTX
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
PPT
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
PPTX
Rapid generation of E.coli O104:H4 PCR diagnostics
PDF
Introduction to Bioinformatics
PDF
Mining Plant Pathogen Genomes for Effectors
PDF
Comparative Genomics and Visualisation - Part 2
PDF
Comparative Genomics and Visualisation - Part 1
In a Different Class?
RDVW Hands-on session: Python
Little Rotters: Adventures With Plant-Pathogenic Bacteria
Pathogen Genome Data
Reverse-and forward-engineering specificity of carbohydrate-processing enzymes
Whole genome taxonomic classi cation for prokaryotic plant pathogens
Microbial Agrogenomics 4/2/2015, UK-MX Workshop
BM405 Lecture Slides 21/11/2014 University of Strathclyde
Sequencing and Beyond?
Highly Discriminatory Diagnostic Primer Design From Whole Genome Data
ICSB 2013 - Visits Abroad Report
Adventures in Bioinformatics (2012)
Golden Rules of Bioinformatics
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
Rapid generation of E.coli O104:H4 PCR diagnostics
Introduction to Bioinformatics
Mining Plant Pathogen Genomes for Effectors
Comparative Genomics and Visualisation - Part 2
Comparative Genomics and Visualisation - Part 1

Recently uploaded (20)

PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
famous lake in india and its disturibution and importance
PPTX
neck nodes and dissection types and lymph nodes levels
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PDF
An interstellar mission to test astrophysical black holes
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
BIOMOLECULES PPT........................
PDF
HPLC-PPT.docx high performance liquid chromatography
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
DOCX
Viruses (History, structure and composition, classification, Bacteriophage Re...
PPTX
Derivatives of integument scales, beaks, horns,.pptx
Comparative Structure of Integument in Vertebrates.pptx
Taita Taveta Laboratory Technician Workshop Presentation.pptx
2. Earth - The Living Planet Module 2ELS
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
7. General Toxicologyfor clinical phrmacy.pptx
famous lake in india and its disturibution and importance
neck nodes and dissection types and lymph nodes levels
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Classification Systems_TAXONOMY_SCIENCE8.pptx
The KM-GBF monitoring framework – status & key messages.pptx
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
An interstellar mission to test astrophysical black holes
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
BIOMOLECULES PPT........................
HPLC-PPT.docx high performance liquid chromatography
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
AlphaEarth Foundations and the Satellite Embedding dataset
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
Viruses (History, structure and composition, classification, Bacteriophage Re...
Derivatives of integument scales, beaks, horns,.pptx

Microbial Genomics and Bioinformatics: BM405 (2015)

  • 1. Microbial Genomics and Bioinformatics BM405 1.Introduction Leighton Pritchard1,2,3 1 Information and Computational Sciences, 2 Centre for Human and Animal Pathogens in the Environment, 3 Dundee Effector Consortium, The James Hutton Institute, Invergowrie, Dundee, Scotland, DD2 5DA
  • 2. Acceptable Use Policy Recording of this talk, taking photos, discussing the content using email, Twitter, blogs, etc. is permitted (and encouraged), providing distraction to others is minimised. These slides will be made available on SlideShare. These slides, and supporting material including exercises, are available at https://github.com/widdowquinn/Teaching- Strathclyde-BM405
  • 3. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 4. The impacta a Loman and Pallen (2015) Nat. Rev. Micro. doi:10.1038/nrmicro3565 Genome sequencing and bioinformatics have transformed our understanding of prokaryotic biology: • function • evolution • interactions • community structure • real-time monitoring and diagnostics • as a platform for synthetic biology It now takes much longer to analyse than generate data
  • 5. The endpoints • 2003: Erwinia carotovora subsp. atroseptica • 2015: Dickeya spp., Campylobacter spp., and Escherichia coli
  • 6. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 7. 2003: E. carotovora subsp. atroseptica • £250k collaboration between SCRI, University of Cambridge, WT Sanger Institute • Single isolate: E. carotovora subsp. atroseptica SCRI1043 • The first sequenced enterobacterial plant pathogen (32 authors!) 1 • All repeats and gaps bridged and sequenced directly • Result: a single, complete, high-quality 5Mbp circular chromosome at 10.2X coverage: 106,500 reads 1 Bell et al. (2004) Proc. Natl. Acad. Sci. USA 101: 30:11105-11110. doi:10.1073/pnas.0402424101
  • 8. 2003: E. carotovora subsp. atroseptica A genome sequence is a starting point. . . • Manual annotation by the Sanger Pathogen Sequencing Unit • Literature searches and comparisons • Six people, for six months ≈ three person-years • Genes: BLAST, GLIMMER, ORPHEUS • Functional domains: PFAM, SIGNALP, TMHMM • Metabolism: KEGG • ncRNA: RFAM
  • 9. 2003: E. carotovora subsp. atroseptica Working (Eca_Sanger_annotation.gbk) and published (NC_004547.gbk) annotation files are in the data directory
  • 10. 2003: E. carotovora subsp. atroseptica Compared against all 142 available bacterial genomes2 2 data/Pba directory in the accompanying GitHub repository
  • 11. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 12. 2013: Dickeya spp. Sequenced and annotated 25 new isolates of Dickeya • 25 Dickeya isolates, at least six species • Multiple sequencing methods: 454, Illumina (SE, PE) • Minor publications (6, 8 authors)3,4 • Results: 12-237 fragments containing 4.2-5.1Mbp, at 6-84X coverage, 170k-4m reads • Automated annotation: RAST with manual corrections 3 Pritchard et al. (2013) Genome Ann. 1 (4) doi:10.1128/genomeA.00087-12 4 Pritchard et al. (2013) Genome Ann. 1 (6) doi:10.1128/genomeA.00978-13
  • 13. 2013: Dickeya spp. Within-genus comparisons: large-scale synteny and rearrangement Within-species comparisons: e.g. indels, HGT
  • 14. 2013: Dickeya spp. Within-genus comparisons: whole genome-based species delineation5 5 van der Wolf et al. (2014) Int. J. Syst. Evol. Micr. 64:768-774 doi:10.1099/ijs.0.052944-0
  • 15. 2013: Dickeya spp. Within-genus comparisons: differences in metabolism
  • 16. 2014: E. coli Sequenced and annotated ≈ 190 isolates of E. coli All bacteria environmental, sampled from lysimeters • Illumina paired-end sequencing. Total cost of sequencing 190 bacteria: ≈£11k • Automated annotation: PROKKA
  • 17. 2014: E. coli Sequencing output variable - even though same preps, “same” bacteria, similar sources. • Results: 5-3000 contigs (median ≈ 125); 9kbp-7.1Mbp (median ≈ 5Mbp); 170k-4m reads
  • 18. 2014: E. coli Genome sequencing enables within-species classification Brunei20070942_contigs Muenster20063091_contigs Senftenberg20070885_contigs Lys142_contigs Lys175_contigs Lys130_contigs Lys170_contigs Lys126_contigs Lys167_contigs Lys176_contigs Lys169_contigs Lys50_contigs X5038_contigs Lys131_contigs Lys171_contigs Lys111_contigs Lys107_contigs Lys114_contigs Lys16_contigs Lys22_contigs Lys65_contigs Lys56_contigs Lys113_contigs Lys109_contigs Lys77_contigs Lys102_contigs Lys100_contigs Lys92_contigs Lys94_contigs Lys80_contigs Lys64_contigs Lys82_contigs AW3_contigs X5008_contigs AW4_contigs AW1_contigs Lys118_contigs Lys138_contigs Lys121_contigs Lys122_contigs Lys177_contigs Lys155_contigs Lys165_contigs Lys163_contigs Lys160_contigs Lys161_contigs Lys172_contigs Lys144_contigs Lys135_contigs Lys146_contigs Lys123_contigs Lys124_contigs Lys150_contigs Lys140_contigs Lys157_contigs Lys173_contigs Lys156_contigs Lys158_contigs Lys159_contigs Lys162_contigs Lys5_contigs X5084_contigs X5042_contigs Lys110_contigs Lys136_contigs Lys54_contigs Lys1_contigs Lys6_contigs Lys112_contigs X5012_contigs Lys30_contigs Lys25_contigs Lys43_contigs Lys37_contigs Lys40_contigs Lys151_contigs Lys31_contigs Lys27_contigs Lys42_contigs Lys51_contigs Lys33_contigs Lys46_contigs Lys38_contigs Lys89_contigs Lys23_contigs Lys115_contigs Lys108_contigs Lys104_contigs DSM10973_contigs Lys125_contigs Lys105_contigs Lys17_contigs Lys128_contigs Lys66_contigs Lys73_contigs Lys15_contigs Lys91_contigs DSM8698_contigs DSM8695_contigs Lys74_contigs Lys61_contigs Lys9_contigs Lys153_contigs Lys84_contigs Lys93_contigs Lys72_contigs Lys62_contigs Lys21_contigs Lys59_contigs Lys63_contigs Lys83_contigs Lys19_contigs Lys4_contigs AW13_contigs Lys45_contigs Lys28_contigs Lys53_contigs Lys52_contigs Lys34_contigs Lys36_contigs Lys24_contigs Lys35_contigs Lys68_contigs Lys106_contigs Lys88_contigs Lys97_contigs Lys76_contigs Lys134_contigs Lys58_contigs Lys71_contigs Lys81_contigs Lys129_contigs Lys120_contigs Lys145_contigs Lys137_contigs Lys127_contigs Lys152_contigs Lys101_contigs Lys98_contigs Lys70_contigs Lys133_contigs Lys47_contigs Lys75_contigs Lys48_contigs Lys148_contigs Lys139_contigs Lys141_contigs Lys164_contigs Lys149_contigs Lys147_contigs Lys60_contigs Lys79_contigs Lys168_contigs Lys18_contigs Lys87_contigs Lys96_contigs Lys7_contigs Lys154_contigs Lys117_contigs Lys119_contigs Lys178_contigs Lys116_contigs Lys86_contigs Lys90_contigs Lys41_contigs Lys13_contigs Lys85_contigs X5002_contigs Lys12_contigs Lys39_contigs Lys14_contigs Lys55_contigs Lys29_contigs Lys99_contigs X5035_contigs Lys8_contigs Lys3_contigs X5034_contigs X5088_contigs Lys20_contigs Lys78_contigs Lys11_contigs Brunei20070942_contigs Muenster20063091_contigs Senftenberg20070885_contigs Lys142_contigs Lys175_contigs Lys130_contigs Lys170_contigs Lys126_contigs Lys167_contigs Lys176_contigs Lys169_contigs Lys50_contigs 5038_contigs Lys131_contigs Lys171_contigs Lys111_contigs Lys107_contigs Lys114_contigs Lys16_contigs Lys22_contigs Lys65_contigs Lys56_contigs Lys113_contigs Lys109_contigs Lys77_contigs Lys102_contigs Lys100_contigs Lys92_contigs Lys94_contigs Lys80_contigs Lys64_contigs Lys82_contigs AW3_contigs 5008_contigs AW4_contigs AW1_contigs Lys118_contigs Lys138_contigs Lys121_contigs Lys122_contigs Lys177_contigs Lys155_contigs Lys165_contigs Lys163_contigs Lys160_contigs Lys161_contigs Lys172_contigs Lys144_contigs Lys135_contigs Lys146_contigs Lys123_contigs Lys124_contigs Lys150_contigs Lys140_contigs Lys157_contigs Lys173_contigs Lys156_contigs Lys158_contigs Lys159_contigs Lys162_contigs Lys5_contigs 5084_contigs 5042_contigs Lys110_contigs Lys136_contigs Lys54_contigs Lys1_contigs Lys6_contigs Lys112_contigs 5012_contigs Lys30_contigs Lys25_contigs Lys43_contigs Lys37_contigs Lys40_contigs Lys151_contigs Lys31_contigs Lys27_contigs Lys42_contigs Lys51_contigs Lys33_contigs Lys46_contigs Lys38_contigs Lys89_contigs Lys23_contigs Lys115_contigs Lys108_contigs Lys104_contigs DSM10973_contigs Lys125_contigs Lys105_contigs Lys17_contigs Lys128_contigs Lys66_contigs Lys73_contigs Lys15_contigs Lys91_contigs DSM8698_contigs DSM8695_contigs Lys74_contigs Lys61_contigs Lys9_contigs Lys153_contigs Lys84_contigs Lys93_contigs Lys72_contigs Lys62_contigs Lys21_contigs Lys59_contigs Lys63_contigs Lys83_contigs Lys19_contigs Lys4_contigs AW13_contigs Lys45_contigs Lys28_contigs Lys53_contigs Lys52_contigs Lys34_contigs Lys36_contigs Lys24_contigs Lys35_contigs Lys68_contigs Lys106_contigs Lys88_contigs Lys97_contigs Lys76_contigs Lys134_contigs Lys58_contigs Lys71_contigs Lys81_contigs Lys129_contigs Lys120_contigs Lys145_contigs Lys137_contigs Lys127_contigs Lys152_contigs Lys101_contigs Lys98_contigs Lys70_contigs Lys133_contigs Lys47_contigs Lys75_contigs Lys48_contigs Lys148_contigs Lys139_contigs Lys141_contigs Lys164_contigs Lys149_contigs Lys147_contigs Lys60_contigs Lys79_contigs Lys168_contigs Lys18_contigs Lys87_contigs Lys96_contigs Lys7_contigs Lys154_contigs Lys117_contigs Lys119_contigs Lys178_contigs Lys116_contigs Lys86_contigs Lys90_contigs Lys41_contigs Lys13_contigs Lys85_contigs 5002_contigs Lys12_contigs Lys39_contigs Lys14_contigs Lys55_contigs Lys29_contigs Lys99_contigs 5035_contigs Lys8_contigs Lys3_contigs 5034_contigs 5088_contigs Lys20_contigs Lys78_contigs Lys11_contigs ANIm 0.9 0.92 0.94 0.96 0.98 Value 0100020003000400050006000 Color Key and Histogram Count A B1 B2 C D E F U X
  • 19. 2014: Campylobacter spp. Sequenced ≈ 1034 isolates of Campylobacter Clinical, animal, food-associated isolates • Illumina paired-end sequencing. Total cost of sequencing >1000 bacteria: ≈£60k • Automated annotation: PRODIGAL
  • 20. 2014: Campylobacter spp. • Identified 15554 gene families from genecalls. • To calculate, took 23 days on institute cluster (4e12 pairwise protein comparisons!).
  • 21. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 22. So what’s changed? • Cost: £250k per genome, to £60 per genome. Now cheaper to sequence a genome than to analyse it! • Location: sequencing centre, to benchtop • Data: volume has increased massively - what you get back from machines, and what’s out there to work with More data is better, but also more challenging. • Speed: typical sequencing run time can be less than a day • Software: more software to do more things (but not always better. . .) • New kinds of experiment: genomes, exomes, variant calling, methylated sequences, . . . • New kinds of application: diagnostics, epidemic tracking, metagenomics, . . .
  • 23. So what’s changed? Having a single genome is useful, but having thousands really helps comparative genomics: combining genomic data, evolutionary and comparative biology • Transfer functional understanding of model systems (e.g. E. coli) to non-model organisms • Genomic differences may underpin phenotypic (host range, virulence, physiological) differences • Genome comparisons aid identification of functional elements on the genome • Studying genomics changes reveals evolutionary processes and constraints
  • 24. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 25. Revolutions One and Twoa a Loman and Pallen (2015) Nat. Rev. Micro. doi:10.1038/nrmicro3565 Revolution One: whole-genome shotgun • First bacterial genomes: Haemophilus influenzae (1995); E. coli, Bacillus subtilis (1997) • (Oh, and the human genome) Revolution Two: high-throughput sequencing • ”Next-generation” sequencing (now ”last-generation”). • 454 GS20 (2005), Illumina GAII (2007). • metagenomics; surveillance sequencing; SNP-based comparisons; transposon-sequencing for functional genomics; ChIP-seq; . . .
  • 26. Not all HT sequencing is the same It’s all about the biology, but it all starts with the data. Sequencing technology (including library prep.) affects your sequence data. • Roche/454 • Illumina • Ion Torrent • Pacific Bioscience (PacBio)
  • 27. The basic principle DNA source is fragmented, and the fragments are sequenced.
  • 28. HTS: PE vs SE High-throughput sequencing (e.g. Illumina), reads may be single-end, or paired-end. Putting the jigsaw back together is sequence assembly.
  • 29. Four different chemistriesa a Loman et al. (2012) Nat. Rev. Micro. 31:294-296 doi:10.1038/nbt.2522 Reads differ by technology, and may require different bioinformatic treatment. . . • Roche/454: Pyrosequencing (long reads, but expensive, and high homopolymer errors) (700-800bp, 0.7Gbp, 23h) • Illumina: Reversible terminator (cost-effective, massive throughput, but short read lengths) (2x150bp, 1.5Gbp, 27h) • Ion Torrent: Proton detection (short run times, good throughput, high homopolymers errors) (200bp, 1Gbp, 3h) • PacBio: Real-time sequencing (very long reads, high error rate, expensive) (3-15kbp, 3Gbp/day, 20min) . . . different error profiles, varying capability to assemble/determine variation
  • 30. Costs of sequencinga a Miyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699
  • 31. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 32. Benchmarked performance Apply several sequencing technologies to the same sample(s). Benchmark comparisons inform appropriate choice of sequencing technology6,7,8,9,10,11,12 Progress in technologies is driving research very rapidly. Always look for most recent/relevant benchmarks. Bioinformatic methods also need to be benchmarked. 6 Miyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699 7 Salipante et al. (2014) Appl. Environ. Micro. 80:7583-7591 doi:10.1128/AEM.02206-14 8 Frey et al. (2014) BMC Genomics 15:96 doi:10.1186/1471-2164-15-96 9 Koshimizu et al. (2013) PLoS One 8:e74167 doi:10.1371/journal.pone.0074167 10 Quail et al. (2012) BMC Genomics 13:341 doi:10.1186/1471-2164-13-341 11 Loman et al. (2012) Nat. Biotech. 30:434-439 doi:10.1038/nbt.2198 12 Lam et al. (2011) Nat. Biotech. 1 (6) doi:10.1038/nbt.2065
  • 33. Benchmarking on Vibrioa a Miyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699 • Sequenced Vibrio parahaemolyticus (2x chromosomes, closed reference genome) with four technologies • Chose an assembler for each tech, and assembled reads • Excess reads with Ion/MiSeq: used random subsets of reads to determine required coverage • Aligned assemblies (MUMmer) to known high-quality chromosome sequence, to measure error
  • 34. Benchmarking on Vibrioa a Miyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699
  • 35. Benchmarking on Vibrioa a Miyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699 De novo assembly and alignment against Vibrio parahaemolyticus (2x chromosomes)
  • 36. Benchmarking on Vibrioa a Miyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699 • More and longer reads do not always give the best assemblies: read depth, read distribution, error rate also matters • Optimal assemblies were obtained at around 60x-80x coverage, for Illumina and Ion. • Multiple rRNA regions are fragmented in short-read assemblies • PacBio generated single chromosome contigs • Assembly of multiple-chromosome bacteria is currently feasible Variability in published genomes as methods are not standard (e.g. sequencing technology, assembler, parameter settings and pre-processing). . .
  • 37. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 38. Revolution Threea a Loman and Pallen (2015) Nat. Rev. Micro. doi:10.1038/nrmicro3565 Revolution Three: single-molecule long-read sequencing • Living through the revolution, now • PacBio (SMRT): large machine, expensive • Nanopore: portable device, inexpensive • Less mature, less accurate, improving rapidly
  • 39. The future dominant sequencer? Oxford Nanopore. A sequencer the size of your hand. • Microfluidics, single-molecule sequencing; 11-70kbp reads • Reports current across pore (tiny electron microscope) as molecule moves through • $10/Mbp, 110Mbp per flowcell13 13 Yaniv Erlich (2013) Future Continuous blog
  • 40. Early dataa a Quick et al. (2014) GigaScience 3:22 doi:10.1111/1755-0998.12324 It’s a fast-moving area, and results are improving.
  • 41. Developing tools Oxford Nanopore’s open beta went out without analysis tools. Tools (Poretools, poRe, etc.) were written/tested/validated by the user community14,15 14 Loman and Quinlan (2014) Bioinformatics doi:10.1093/bioinformatics/btu555 15 Watson et al. (2014) Bioinformatics doi:10.1093/bioinformatics/btu590
  • 42. Recent applications • Amplicon sequencing (16S metagenomics) of bacteria and viruses 16 • Real-time viral diagnostics 17 • Scaffolding of a bacterial genome 18 • Complete de novo assembly of a bacterial genome 19 16 Kilianski et al. (2015) GigaScience doi:10.1186/s13742-015-0051-z 17 Greninger et al. (2015) Genome Med. doi:10.1186/s13073-015-0220-9 18 Karlsson et al. (2015) Sci. Reports doi:10.1038/srep11996 19 Loman et al. (2015) Nat. Meth. doi:10.1038/nmeth.3444
  • 43. The three revolutionsa a Loman and Pallen (2015) Nat. Rev. Micro. doi:10.1038/nrmicro3565
  • 44. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 45. Predicting the future is hard. . . “How many genomes will we have, and when?” Su et al. attempted to answer this20: 20 http://sulab.org/2013/06/sequenced-genomes-per-year/
  • 46. After that, the flood. . . High-throughput sequencing methods have completely changed the landscape of microbiology (Nearly) complete, (mainly) accurate sequence data is now inexpensive (and cheaper than analysis) • GOLD (19/2/2014): 3,011 “finished” ; 9,891 “permanent draft” genomes • GOLD (10/11/2015): 7,657 “finished” ; 27,438 “permanent draft” genomes; 50,673 prokaryotes • NCBI WGS (19/2/2014): 17,023 microbial genomes • NCBI Genome (10/11/2015): 55,033 prokaryotic genomes
  • 47. Pseudomonas In 2011, 25 isolate sequences21; in 2015, 2098 genomes: We’re going to need bigger bioinformatics. . . 21 Studholme (2011) Mol. Plant Pathol. doi:10.1111/j.1364-3703.2011.00713.x
  • 48. Microbial Genomics and Bioinformatics BM405 2.Assembly Leighton Pritchard1,2,3 1 Information and Computational Sciences, 2 Centre for Human and Animal Pathogens in the Environment, 3 Dundee Effector Consortium, The James Hutton Institute, Invergowrie, Dundee, Scotland, DD2 5DA
  • 49. Acceptable Use Policy Recording of this talk, taking photos, discussing the content using email, Twitter, blogs, etc. is permitted (and encouraged), providing distraction to others is minimised. These slides will be made available on SlideShare. These slides, and supporting material including exercises, are available at https://github.com/widdowquinn/Teaching- Strathclyde-BM405
  • 50. What do you get from sequencing Sequence reads. Usually lots of them. Size/number/errors depend on technology used. 22 22 Miyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699
  • 51. Sequence Read Data Formats Two common read data sequence formats: • FASTQ: Related to FASTA, a de facto standard for sequence reads • SAM/BAM: Sequence alignment/mapping format, two flavours - uncompressed and compressed New formats are required to handle very large numbers of genomes • CRAM: Reference-based sequence compression You might also receive assembled genomes directly from a sequencing partner
  • 52. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 53. FASTQa a Cock et al. (2009) Bioinformatics 38:1767-1771 doi:10.1093/nar/gkp1137 @HISEQ2500-09:168:HA424ADXX:2:1101:1404:2061 1:N:0:ATCTCTCTCACCAACT CGGTCTTGGGATAGATGGGTTGCAGGTTGCGGTAAAGCTCGGACTCCAGAGCGTCCAGGGTAGACTGGCTAATCTTCTGCTCTTTATCGATCATTATTTC + @@CBDDFFHHDFDHEGHIICGIFHHIIIIFHGGHIEHHIIIIGHGHIIIIIGGHHFFFFC@CBCCCDDBDCDDDDDDDDCCDDDD3@ABDDDDDEEEDE@ Files typically have .fq, .fastq extension. Four lines per sequence 1. Header: sequence identifier and optional description, starts with “@” 2. Raw sequence ([ACGTN]) 3. Optional header, repeats line 1, starts with “+” 4. Quality scores, numbers encoded as ASCII Qphred = −10 log10 e, where e is the estimated probability that a base call is incorrect (like a pH).
  • 54. Quality Control The quality of basecalls (error rate) varies between and along reads. (real data from our E.coli sequencing: good quality)
  • 55. Quality Control Some datasets are better than others. Reads can be trimmed, or discarded. Including poor reads compromises assembly.
  • 56. FASTQ encodinga a Cock et al. (2009) Bioinformatics 38:1767-1771 doi:10.1093/nar/gkp1137 More than one version of FASTQ, differ by quality encoding Numbers converted to ASCII start at different values
  • 57. FASTQ encodinga a Cock et al. (2009) Bioinformatics 38:1767-1771 doi:10.1093/nar/gkp1137 Versions vary by sequencer and period. Most now settled on Sanger format (occasionally see historical data). Quality scores (Qphred ) offset to lie in the given range: 1. Sanger: 33-126, used in SAM/BAM, and Illumina 1.8+ 2. Illumina 1.0-1.2: 59-126 3. Illumina 1.3-1.8: 64-126 Knowing where your data comes from, and the data format and version, is always important.
  • 58. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 59. SAMa a https://github.com/samtools/hts-specs Intended to represent read alignments, also used for raw reads. Tab-delimited plain text. Headers (optional) start with “@”
  • 60. BAMa /CRAMb a https://github.com/samtools/hts-specs b http://www.ebi.ac.uk/ena/software/cram-toolkit BAM is a compressed version of SAM. • BGZF compression. • Random access within compressed file, through indexing. CRAM format may come to dominate, especially in archives, as datasets get larger: • Reference-based compression.23 • Highly suited to compression and archiving of very large amounts of sequence data.24 23 Fritz et al. (2011) Genome Res. 21:734-740 doi:10.1101/gr.114819.110 24 Cochrane et al. (2012) GigaScience 1:2 doi:10.1186/2047-217X-1-2
  • 61. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 62. Read repositories Repositories are centrally-maintained locations that keep sequence read data from multiple projects Submission to a repository is a requirement for publication. And the right thing to do! • ENA: The European Nucleotide Archive (http://www.ebi.ac.uk/ena), maintained by EBI/EMBL • SRA: The Short Read Archive (http://www.ncbi.nlm.nih.gov/sra), maintained in the US by NCBI
  • 63. Sequence Assembly Once you have reads, you can assemble a genome. Two main approaches to read assembly: • Overlap-Layout-Consensus: Typically used with smaller sets of longer reads (e.g. 454, PacBio, Ion, Nanopore) • de Bruijn assembly: Typically used with many, shorter reads (e.g. Illumina), but also useful for longer reads See e.g. Leland Taylor’s thesis (http://gcat.davidson.edu/phast/docs/Thesis PHAST LelandTaylor.pdf), and PHAST (http://gcat.davidson.edu/phast/index.html).
  • 64. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 66. Overlap-Layout-Consensus The oldest approach, originally used with smaller sets of fewer reads. Can be time consuming (all-vs-all comparisons), but offset with graph-based OLC algorithms (e.g. SGA). Now more important again, with long-read data. • Celera Assembler25 • Newbler (the Roche/454 GS assembler)26 • String Graph Assembler27 25 http://wgs-assembler.sourceforge.net/ 26 http://www.454.com/products/analysis-software/ 27 Simpson and Durbin (2012) Genome Res. 22:549-556 doi:10.1101/gr.126953.111
  • 67. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 68. de Bruijn graph assembly Used for short reads (e.g. Illumina): k-mer based graph (choice of k important):
  • 69. de Bruijn graph assembly k-mer based genome and read graphs28 “True” edges = genome; “Error” edges = wrong assembly 28 Chaisson et al. (2009) Genome Res. 19:336-346 doi:10.1101/gr.079053.108
  • 70. de Bruijn graph assembly All sequencing technologies have basecall errors. • The proportion of errors is approximately constant per read • Basecall errors lead to edge errors • The more reads you have, the more errors there are Increased coverage does not ensure increased accuracy29 29 Conway and Bromage (2011) Bioinformatics 27:479-486 doi:10.1093/bioinformatics/btq697
  • 71. de Bruijn graph assembly Fast, and scales well to large datasets, as it never computes all-against-all overlaps. Sensitive to sequencing errors, but resolves short repeats (graph bulges and whirls). Notable tools: • Velvet30 • CLC Assembly Cell31 • Cortex32 30 Zerbino and Birney (2008) Genome Res. 18:821-829 doi:10.1101/gr.074492.107 31 http://www.clcbio.com/products/clc-assembly-cell/ 32 Iqbal et al. (2012) Nat. Genet. 44:226-232 doi:10.1038/ng.1028
  • 72. “Coloured” de Bruijn graph assemblies Cortex33 allows for on-the-fly identification of complex variation, and genotyping, by tracking “coloured” edges in the graph. Colours ≈ different isolates/organisms (e.g. a reference) 33 Iqbal et al. (2012) Nat. Genet. 44:226-232 doi:10.1038/ng.1028
  • 73. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 74. Why map reads?a a Trapnell et al. (2009) Nat. Biotech. 27:455-457 doi:10.1038/nbt0509-455 “Resequencing” an organism (sequencing a close relative, looking for SNPs/indels) RNA-seq, ChIP-seq, etc. - coverage ≈ expression/binding To see where reads map on an assembled genome • Is coverage even? (can indicate repeats) • Are there SNPs/indels? (heterogeneous population) • Assembly problems?
  • 75. Short-Read Sequence Alignmenta a Trapnell et al. (2009) Nat. Biotech. 27:455-457 doi:10.1038/nbt0509-455 An embarrassment of tools (over 60 listed on Wikipedia) Main approaches: • Alignment: Smith-Waterman mathematically guaranteed to be the best alignment available (e.g. BFAST, MOSAIK); approximation to S-W (e.g. BLAST); ungapped or gapped alignment (e.g. MAQ, FAST, mrFAST, SOAP). Can be slow. • Burrows-Wheeler Transform: Makes reusable index of the genome (e.g. Bowtie, BWA), can be extended to consider sequence probability (e.g. BWA-PSSM). Can be very fast. Other tools may employ different algorithms, some designed to be parallelised on GPUs/FPGAs (e.g. NextGenMap, XpressAlign)
  • 76. Visualising Read Mapping Several tools available, e.g. Tablet (the best. . .)34 34 Milne et al. (2013) Brief. Bioinf. 14:193-202 doi:10.1093/bib/bbs012
  • 77. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 78. In an ideal world Ideally, you would have one sequence per chromosome/plasmid. (and no errors): a closed/complete genome. PacBio, Sanger, manual closing, Nanopore(?)
  • 79. More realistically. . . Typically, a number of assembled fragments (contigs or scaffolds) are returned in FASTA format: a draft, disordered genome. Around 250 contigs for a 5Mbp genome is usual with Illumina
  • 80. Ordering contigs Contigs can be ordered correctly into scaffolds if paired-end reads span gaps, or long reads are available (typically done during assembly). Gaps are usually filled with Ns (length estimated)
  • 81. Ordering contigs Contigs and scaffolds can also be reordered by alignment to a reference genome. • Mauve/progressiveMauve35 • MUMmer36 35 Darling et al. (2004) Genome Res. 14:1394-1403 doi:10.1101/gr.2289704 36 Kurtz et al. (2004) Genome Biol. 5:R12 doi:10.1186/gb-2004-5-2-r12
  • 82. Where next?a a Lefebure et al. (2010) Genome Biol. Evol. 2:646-655 doi:10.1093/gbe/evq048
  • 83. Microbial Genomics and Bioinformatics BM405 3.Whole Genome Comparisons Leighton Pritchard1,2,3 1 Information and Computational Sciences, 2 Centre for Human and Animal Pathogens in the Environment, 3 Dundee Effector Consortium, The James Hutton Institute, Invergowrie, Dundee, Scotland, DD2 5DA
  • 84. Acceptable Use Policy Recording of this talk, taking photos, discussing the content using email, Twitter, blogs, etc. is permitted (and encouraged), providing distraction to others is minimised. These slides will be made available on SlideShare. These slides, and supporting material including exercises, are available at https://github.com/widdowquinn/Teaching- Strathclyde-BM405
  • 85. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 86. The Power of Comparative Genomics Massively enabled by high-throughput sequencing, and the availability of thousands of sequenced isolates. Computational comparisons more powerful and precise than experimental comparative genomics: the ultimate microbial typing solution Three broad areas/scales: • Comparison of bulk genome properties • Whole genome sequence comparisons • Comparison of features/functional components
  • 87. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 88. Nucleotide frequency/genome size • Very easy to calculate from complete/draft genome • Can calculate for individual contigs/scaffolds/regions • Usually reported in GUI genome browsers Trivial to determine using, e.g. Python
  • 89. Nucleotide frequency/genome size GC content and chromosome size can be characteristic See data/bacteria size for example iPython notebook exercise
  • 90. Blobologya a Kumar and Blaxter et al. (2011) Symbiosis 3:119-126 doi:10.1007/s13199-012-0154-6 Sequencing samples may be contaminated or contain microbial symbionts. Expect more host than symbiont/contaminant DNA GC content and read coverage can be used to separate contigs, following assembly and mapping http://nematodes.org/bioinformatics/blobology/
  • 91. k-mers • Nucleotides: [ACGT] • Dinucleotides: [AA|AC|AG|AT|CA|CC|. . .] (16 dimers) • Trinucleotides: [AAA|AAC|AAG|AAT|ACA|. . .] (64 trimers) • k-mers: 4k k-mers (see example in data/shiny)
  • 92. k-mers GC content = point value; k-mer frequencies = vector (list) Diagnostic differences in k-mer frequency, and variability. The basis of several comparison tools E.coli Mycoplasma spp.
  • 93. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 94. What to align, and why? To be useful, aligned genomes should: • derive from a sufficiently recent common ancestor, so homologous regions can be identified • derive from a sufficiently distant common ancestor, so that there are “interesting” differences to be identified • help to answer your biological question
  • 95. How to align, and why? Naive sequence aligners (Needleman-Wunsch, Smith-Waterman) are not appropriate for genome alignment • Computationally expensive on large sequences • Cannot handle rearrangements Very many alternative alignment algorithms proposed • megaBLAST http://www.ncbi.nlm.nih.gov/blast/html/megablast.html • MUMmer http://mummer.sourceforge.net/ • BLAT http://genome.ucsc.edu/goldenPath/help/blatSpec.html • LASTZ http://www.bx.psu.edu/∼rsharris/lastz/ • LAGAN http://lagan.stanford.edu/lagan web/index.shtml • and many, many more. . . Example exercises in data/whole_genome_alignment.
  • 96. megaBLAST Optimised for speed, over BLASTN37 • Genome-level searches • Queries on large sequence sets • Long alignments of very similar sequence Uses the greedy algorithm by Zhang et al.38, not BLAST algorithm. • Concatenates queries (“query packing”) to improve performance • Two modes: megaBLAST and discontinuous (dc-megablast) for divergent sequences BLASTN now uses the megaBLAST algorithm by default 37 http://www.ncbi.nlm.nih.gov/blast/Why.shtml 38 Zhang et al. (2000) J. Comp. Biol. 7:203-214 doi:10.1089/10665270050081478
  • 97. BLAST vs megaBLAST megaBLAST is faster, but does it give the same biological results? megaBLAST (top) and BLAST (bottom) pairwise comparisons:
  • 98. BLAST vs megaBLAST Filter out weak matches - not quite identical:
  • 99. MUMmera a Kurtz et al. (2004) Genome. Biol. 5:R12 doi:10.1186/gb-2004-5-2-r12 Uses suffix trees for pattern matching: very fast even for large sequences • Finds maximal exact matches • Memory use depends only on the reference sequence size Suffix trees: (http://en.wikipedia.org/wiki/Suffix tree) • Can be built and searched in O(n) time • But useful algorithms are nontrivial
  • 100. The MUMmer algorithma a Kurtz et al. (2004) Genome. Biol. 5:R12 doi:10.1186/gb-2004-5-2-r12 1. Identify a non-overlapping subset of maximal exact matches: often Maximal Unique Matches (MUMs) 2. Cluster into alignment anchors 3. Extend between anchors to produce the final alignment This is the basis of a very flexible suite of programs that align different kinds of sequence: mummer, nucmer, promer • nucleotide and (more sensitive) “conceptual protein” alignments • used for genome comparisons, assembly scaffolding, repeat detection, . . . • the basis of other aligners/assemblers (e.g. Mugsy, AMOS)
  • 101. MUMmer vs megaBLAST MUMmer identifies fewer weak matches megaBLAST (top) and MUMmer (bottom) pairwise comparisons:
  • 102. MUMmer vs megaBLAST Filter out weak BLAST matches - not quite identical:
  • 103. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 104. DNA-DNA hybridisationa a Morello-Mora and Amann (2001) FEMS Micro. Rev. 25:39-67 doi:10.1016/S0168-6445(00)00040-1 • “Gold Standard” for prokaryotic taxonomy, since 1960s. “70% identity ≈ same species.” • Denature DNA from two organisms. • Allow to anneal. Reassociation ≈ similarity, measured as ∆T of denaturation curves. Proxy for sequence similarity - replace with genome analysis39? 39 Chan et al (2012) BMC Microbiol. 12:302 doi:10.1186/1471-2180-12-302
  • 105. Average Nucleotide Identity (ANIb)a a Goris et al. (2007) Int. J. Syst. Biol. 57:81-91 doi:10.1099/ijs.0.64483-0 1. Break genomes into 1020t fragments 2. ANIb: Mean % identity of all BLASTN matches with > 30% identity and > 70% fragment coverage. • DDH:ANIb linear • DDH:%ID linear • 70%ID ≈ 95%ANIb
  • 106. Average Nucleotide Identity (ANIm)a a Richter and Rossello-Mora (2009) Proc. Natl. Acad. Sci. USA 106:19126-19131 doi:10.1073/pnas.0906412106 1. Align genomes (MUMmer) 2. ANIm: Mean % identity of all matches • DDH:ANIm linear • 70%ID ≈ 95%ANIb TETRA: tetranucleotide frequency-based classifier introduced in same paper.
  • 107. ANI/TETRA comparison All three methods applied to Anaplasma spp. ANIb: A_phagocytophilum_NC_021881 A_phagocytophilum_NC_021880 A_phgocytophilum_NC_021879 A_phagocytophilum_NC_007797 A_centrale_NC_013532 A_marginale_NC_004842 A_marginale_NC_012026 A_marginale_NC_022760 A_marginale_NC_022784 A_phagocytophilum_NC_021881 A_phagocytophilum_NC_021880 A_phgocytophilum_NC_021879 A_phagocytophilum_NC_007797 A_centrale_NC_013532 A_marginale_NC_004842 A_marginale_NC_012026 A_marginale_NC_022760 A_marginale_NC_022784 ANIb 0.9 0.94 0.98 Value 02040 Color Key and Histogram Count ANIm: A_phgocytophilum_NC_021879 A_phagocytophilum_NC_007797 A_phagocytophilum_NC_021880 A_phagocytophilum_NC_021881 A_centrale_NC_013532 A_marginale_NC_012026 A_marginale_NC_004842 A_marginale_NC_022760 A_marginale_NC_022784 A_phgocytophilum_NC_021879 A_phagocytophilum_NC_007797 A_phagocytophilum_NC_021880 A_phagocytophilum_NC_021881 A_centrale_NC_013532 A_marginale_NC_012026 A_marginale_NC_004842 A_marginale_NC_022760 A_marginale_NC_022784 ANIm 0.9 0.94 0.98 Value0102030 Color Key and Histogram Count TETRA: A_phagocytophilum_NC_021880 A_phgocytophilum_NC_021879 A_phagocytophilum_NC_007797 A_phagocytophilum_NC_021881 A_centrale_NC_013532 A_marginale_NC_022760 A_marginale_NC_022784 A_marginale_NC_012026 A_marginale_NC_004842 A_phagocytophilum_NC_021880 A_phgocytophilum_NC_021879 A_phagocytophilum_NC_007797 A_phagocytophilum_NC_021881 A_centrale_NC_013532 A_marginale_NC_022760 A_marginale_NC_022784 A_marginale_NC_012026 A_marginale_NC_004842 TETRA 0.9 0.94 0.98 Value 02040 Color Key and Histogram Count ANIb discards information, relative to ANIm: less sensitive ANIb/ANIm ≈ evolutionary history; TETRA ≈ bulk composition
  • 108. ANI in practice Practical applications40 (note: no gene content used) 34 Dickeya isolates: species structure 180 E.coli isolates: subtyping Brunei20070942_contigs Muenster20063091_contigs Senftenberg20070885_contigs Lys142_contigs Lys175_contigs Lys130_contigs Lys170_contigs Lys126_contigs Lys167_contigs Lys176_contigs Lys169_contigs Lys50_contigs X5038_contigs Lys131_contigs Lys171_contigs Lys111_contigs Lys107_contigs Lys114_contigs Lys16_contigs Lys22_contigs Lys65_contigs Lys56_contigs Lys113_contigs Lys109_contigs Lys77_contigs Lys102_contigs Lys100_contigs Lys92_contigs Lys94_contigs Lys80_contigs Lys64_contigs Lys82_contigs AW3_contigs X5008_contigs AW4_contigs AW1_contigs Lys118_contigs Lys138_contigs Lys121_contigs Lys122_contigs Lys177_contigs Lys155_contigs Lys165_contigs Lys163_contigs Lys160_contigs Lys161_contigs Lys172_contigs Lys144_contigs Lys135_contigs Lys146_contigs Lys123_contigs Lys124_contigs Lys150_contigs Lys140_contigs Lys157_contigs Lys173_contigs Lys156_contigs Lys158_contigs Lys159_contigs Lys162_contigs Lys5_contigs X5084_contigs X5042_contigs Lys110_contigs Lys136_contigs Lys54_contigs Lys1_contigs Lys6_contigs Lys112_contigs X5012_contigs Lys30_contigs Lys25_contigs Lys43_contigs Lys37_contigs Lys40_contigs Lys151_contigs Lys31_contigs Lys27_contigs Lys42_contigs Lys51_contigs Lys33_contigs Lys46_contigs Lys38_contigs Lys89_contigs Lys23_contigs Lys115_contigs Lys108_contigs Lys104_contigs DSM10973_contigs Lys125_contigs Lys105_contigs Lys17_contigs Lys128_contigs Lys66_contigs Lys73_contigs Lys15_contigs Lys91_contigs DSM8698_contigs DSM8695_contigs Lys74_contigs Lys61_contigs Lys9_contigs Lys153_contigs Lys84_contigs Lys93_contigs Lys72_contigs Lys62_contigs Lys21_contigs Lys59_contigs Lys63_contigs Lys83_contigs Lys19_contigs Lys4_contigs AW13_contigs Lys45_contigs Lys28_contigs Lys53_contigs Lys52_contigs Lys34_contigs Lys36_contigs Lys24_contigs Lys35_contigs Lys68_contigs Lys106_contigs Lys88_contigs Lys97_contigs Lys76_contigs Lys134_contigs Lys58_contigs Lys71_contigs Lys81_contigs Lys129_contigs Lys120_contigs Lys145_contigs Lys137_contigs Lys127_contigs Lys152_contigs Lys101_contigs Lys98_contigs Lys70_contigs Lys133_contigs Lys47_contigs Lys75_contigs Lys48_contigs Lys148_contigs Lys139_contigs Lys141_contigs Lys164_contigs Lys149_contigs Lys147_contigs Lys60_contigs Lys79_contigs Lys168_contigs Lys18_contigs Lys87_contigs Lys96_contigs Lys7_contigs Lys154_contigs Lys117_contigs Lys119_contigs Lys178_contigs Lys116_contigs Lys86_contigs Lys90_contigs Lys41_contigs Lys13_contigs Lys85_contigs X5002_contigs Lys12_contigs Lys39_contigs Lys14_contigs Lys55_contigs Lys29_contigs Lys99_contigs X5035_contigs Lys8_contigs Lys3_contigs X5034_contigs X5088_contigs Lys20_contigs Lys78_contigs Lys11_contigs Brunei20070942_contigs Muenster20063091_contigs Senftenberg20070885_contigs Lys142_contigs Lys175_contigs Lys130_contigs Lys170_contigs Lys126_contigs Lys167_contigs Lys176_contigs Lys169_contigs Lys50_contigs 5038_contigs Lys131_contigs Lys171_contigs Lys111_contigs Lys107_contigs Lys114_contigs Lys16_contigs Lys22_contigs Lys65_contigs Lys56_contigs Lys113_contigs Lys109_contigs Lys77_contigs Lys102_contigs Lys100_contigs Lys92_contigs Lys94_contigs Lys80_contigs Lys64_contigs Lys82_contigs AW3_contigs 5008_contigs AW4_contigs AW1_contigs Lys118_contigs Lys138_contigs Lys121_contigs Lys122_contigs Lys177_contigs Lys155_contigs Lys165_contigs Lys163_contigs Lys160_contigs Lys161_contigs Lys172_contigs Lys144_contigs Lys135_contigs Lys146_contigs Lys123_contigs Lys124_contigs Lys150_contigs Lys140_contigs Lys157_contigs Lys173_contigs Lys156_contigs Lys158_contigs Lys159_contigs Lys162_contigs Lys5_contigs 5084_contigs 5042_contigs Lys110_contigs Lys136_contigs Lys54_contigs Lys1_contigs Lys6_contigs Lys112_contigs 5012_contigs Lys30_contigs Lys25_contigs Lys43_contigs Lys37_contigs Lys40_contigs Lys151_contigs Lys31_contigs Lys27_contigs Lys42_contigs Lys51_contigs Lys33_contigs Lys46_contigs Lys38_contigs Lys89_contigs Lys23_contigs Lys115_contigs Lys108_contigs Lys104_contigs DSM10973_contigs Lys125_contigs Lys105_contigs Lys17_contigs Lys128_contigs Lys66_contigs Lys73_contigs Lys15_contigs Lys91_contigs DSM8698_contigs DSM8695_contigs Lys74_contigs Lys61_contigs Lys9_contigs Lys153_contigs Lys84_contigs Lys93_contigs Lys72_contigs Lys62_contigs Lys21_contigs Lys59_contigs Lys63_contigs Lys83_contigs Lys19_contigs Lys4_contigs AW13_contigs Lys45_contigs Lys28_contigs Lys53_contigs Lys52_contigs Lys34_contigs Lys36_contigs Lys24_contigs Lys35_contigs Lys68_contigs Lys106_contigs Lys88_contigs Lys97_contigs Lys76_contigs Lys134_contigs Lys58_contigs Lys71_contigs Lys81_contigs Lys129_contigs Lys120_contigs Lys145_contigs Lys137_contigs Lys127_contigs Lys152_contigs Lys101_contigs Lys98_contigs Lys70_contigs Lys133_contigs Lys47_contigs Lys75_contigs Lys48_contigs Lys148_contigs Lys139_contigs Lys141_contigs Lys164_contigs Lys149_contigs Lys147_contigs Lys60_contigs Lys79_contigs Lys168_contigs Lys18_contigs Lys87_contigs Lys96_contigs Lys7_contigs Lys154_contigs Lys117_contigs Lys119_contigs Lys178_contigs Lys116_contigs Lys86_contigs Lys90_contigs Lys41_contigs Lys13_contigs Lys85_contigs 5002_contigs Lys12_contigs Lys39_contigs Lys14_contigs Lys55_contigs Lys29_contigs Lys99_contigs 5035_contigs Lys8_contigs Lys3_contigs 5034_contigs 5088_contigs Lys20_contigs Lys78_contigs Lys11_contigs ANIm 0.9 0.92 0.94 0.96 0.98 Value 0100020003000400050006000 Color Key and Histogram Count A B1 B2 C D E F U X 40 van der Wolf et al. (2014) Int. J. Syst. Evol. Micr. 64:768-774 doi:10.1099/ijs.0.052944-0
  • 109. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 110. Collinearity and Synteny Genome rearrangements occur, but there can still be conservation of sequence similarity and ordering. • Two elements are collinear if they lie in the same linear sequence • Two elements are syntenous (or syntenic) if: • (orig.) they lie on the same chromosome • (mod.) there is conservation of blocks of order within the same chromosome Signs of evolutionary constraints, like sequence conservation or synteny, may indicate functional genome regions.
  • 111. Pyrococcus spp.a a Zivanovic et al. (2002) Nuc. Acids Res. 30:1902-1910 doi:10.1093/nar/30.9.1902 Comparison of Pyrococcus genomes (P. horikoshii, P. abyssi, P. furiosus) shows chromosome-shuffling. Transposition a major cause of genomic disruption
  • 112. Vibrio mimicus a a Hasan et al. (2010) Proc. Natl. Acad. Sci. USA 107:21134-21139 doi:10.1073/pnas.1013825107 Chromosome C-II carries genes associated with environmental adaptation; C-I carries virulence genes. C-II has undergone extensive rearrangement; C-I has not. Suggests modularity of genome organisation, as a mechanism for adaptation (HGT, two-speed genome).
  • 113. Serratia symbiotica a a Burke and Moran (2011) Genome Biol. Evol. 3:195-208 doi:10.1093/gbe/evr002 S. symbiotica is a recently evolved symbiont of aphids Massive genomic decay is an adaptation to the new environment.
  • 114. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 115. Multiple genome alignment is hard Can we not just align all our genomes, together? No. Because it’s really, really hard. Analogous to problems with multiple sequence alignment (three or more sequences). • Computationally extremely expensive (O(Ln), L=length of sequence, n=number of sequences) • NP-complete problem: no known efficient way to find a solution Heuristic (approximate) methods are used, most commonly: • Progressive alignment • Iterative alignment
  • 116. Mauvea a Darling et al. (2004) Genome Res. 14:1394-1403 doi:10.1101/gr.2289704 Progressive alignment tool, with a GUI. Application to nine enterobacteria: rearrangement of homologous backbone. Alternatives include MLAGAN41 and MUMmer42 41 Brudno et al. (2003) Genome Res. 13:721-731 doi:10.1101/gr.926603 42 Kurtz et al. (2004) Genome Biol. 5:R12 doi:10.1186/gb-2004-5-2-r12
  • 117. Mauve algorithma a Darling et al. (2004) Genome Res. 14:1394-1403 doi:10.1101/gr.2289704 1. Find local alignments (multi-MUMs) 2. Build guide tree from multi-MUMs 3. Select subset of multi-MUMs as anchors, and partition into Local Collinear Blocks (LCBs): consistently ordered subsets 4. Progressive alignment against guide tree
  • 118. Reordering contigsa a Darling et al. (2004) Genome Res. 14:1394-1403 doi:10.1101/gr.2289704 Mauve also enables draft genome reordering. Once LCBs are identified, can apply Mauve Contig Mover to reorder contigs Example exercise in data/whole_genome_alignment
  • 119. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 120. Chromosome paintinga a Yahara et al. (2013) Mol. Biol. Evol. 30:1454-1464 doi:10.1093/molbev/mst055 “Chromosome painting” infers recombination-derived ‘chunks’ Genome’s haplotype constructed in terms of recombination events from a ‘donor’ to a ‘recipient’ genome
  • 121. Chromosome paintinga a Yahara et al. (2013) Mol. Biol. Evol. 30:1454-1464 doi:10.1093/molbev/mst055 Recombination events summarised in a coancestry matrix. H. pylori: most within geographical bounds, but asymmetrical donation from Amerind/East Asian to European isolates.
  • 122. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 123. P.aeruginosa nosocomial acquisitiona a Quick et al. (2014) BMJ Open 4: e006278. doi:10.1136/bmjopen-2014-006278 Motivation Nosocomial water transmission of P.aeruginosa an urgent concern Setup Burns patients (30) screened for P.aeruginosa on admission Samples taken from patients and environment All P.aeruginosa isolates (141) WGS sequenced Outcome Clustering of isolates by room and outlet Three patient isolates identical to water isolates from same room Biofilm from thermostatic mixer valve a possible source
  • 124. P.aeruginosa nosocomial acquisitiona a Quick et al. (2014) BMJ Open 4: e006278. doi:10.1136/bmjopen-2014-006278
  • 125. P.aeruginosa nosocomial acquisitiona a Quick et al. (2014) BMJ Open 4: e006278. doi:10.1136/bmjopen-2014-006278 Methods • Illumina MiSeq WGS of 141 isolates • Metagenomic sequencing of biofilm • Simulated sequencing of 55 published P. aeruginosa • BWA mapping against PAO1 reference genome • SNPs called with SAMtools & VarScan • ML reconstruction with FastTree • De novo assembly with Velvet for MLST prediction Sequences and bioinformatic methods shared online: http://www.github.com/joshquick/snp calling scripts
  • 126. P.aeruginosa nosocomial acquisitiona a Quick et al. (2014) BMJ Open 4: e006278. doi:10.1136/bmjopen-2014-006278
  • 127. P.aeruginosa nosocomial acquisitiona a Quick et al. (2014) BMJ Open 4: e006278. doi:10.1136/bmjopen-2014-006278
  • 128. P.aeruginosa nosocomial acquisitiona a Quick et al. (2014) BMJ Open 4: e006278. doi:10.1136/bmjopen-2014-006278 Strengths A P. aeruginosa source could be tracked by WGS Insights into transmission: water to patient a likely route Sensitivity - identifies microevolution Limitations Small sample size: 5/30 patients infected, gave 55/141 isolates Not clear that causal inferences are general 300-day sampling, not real-time crisis analysis Good existing reference genome set for this bacterium Sequencing cost: ≈£8k; Staff cost: ≈£15k
  • 129. Microbial Genomics and Bioinformatics BM405 4.Genome Features Leighton Pritchard1,2,3 1 Information and Computational Sciences, 2 Centre for Human and Animal Pathogens in the Environment, 3 Dundee Effector Consortium, The James Hutton Institute, Invergowrie, Dundee, Scotland, DD2 5DA
  • 130. Acceptable Use Policy Recording of this talk, taking photos, discussing the content using email, Twitter, blogs, etc. is permitted (and encouraged), providing distraction to others is minimised. These slides will be made available on SlideShare. These slides, and supporting material including exercises, are available at https://github.com/widdowquinn/Teaching- Strathclyde-BM405
  • 131. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 132. Genome Features • Genome features are annotated regions of the genome. • Typically represent functional elements. • May be simple (single region), or complex (subfeatures)
  • 133. Why annotate genome features? • Almost all use of genomics depends on annotation: annotation quality is critical to downstream use of genomics in biology • Annotation is curation (a live, active process), not cataloguing • Automated annotation from curated data (public databases) is the only game in town, given the data quantities we generate • But you can’t propagate something that doesn’t exist: up to 30% of metabolic activity has no known gene associated with it43 • Biocurators can spend as much time “de-annotating” literature-based annotations as entering new data44 43 Chen and Vitkup (2007) Trends Biotech. doi:10.1016/j.tibtech.2007.06.001 44 Bairoch (2009) Nat. Preced. doi:10.1038/npre.2009.3092.1
  • 134. Gene Features Gene features have significant substructure, especially in eukaryotes. • 5‘ UTR • translation start • intron start/stop • exon start/stop • translation stop • translation terminator • 3‘ UTR
  • 135. ncRNA Features • tRNA - transfer RNA • rRNA - ribosomal RNA • CRISPRs - bacterial/archaeal defence (used for genome editing) • many other classes
  • 136. Regulatory/Repeat Features Regulatory sites • transcription start sites • RNA polymerase binding sites • Transcription Factor Binding Sites (TFBS) Repetitive regions and mobile elements • tandem repeats • (retro-)transposable elements • phage inclusions
  • 137. Principles of feature prediction Two main approaches to feature prediction: • ab initio prediction - start from first principles, using only the genome sequence: • Unsupervised methods - not trained on a dataset • Supervised methods - trained on a dataset • homology matches • alignment to features from related organisms (comparative genomics, annotation transfer) • from known gene products (e.g. proteins, ncRNA) • from transcripts/other intermediates (e.g. ESTs, cDNA, RNAseq) Dedicated tools available for many different classes of feature.
  • 138. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 139. Prokaryotic CDS Prediction Methods Using CDS prediction as an illustrative example for all feature prediction. Sequence conservation (evolutionary constraint; an unsupervised, a priori method) can be useful • Prokaryotes “easier” than eukaryotes for gene/CDS prediction • Less uncertainty in predictions (isoforms, gene structure) • Very gene-dense (over 90% of chromosome is coding sequence) • No intron-exon structure
  • 140. Prokaryotic CDS Prediction Methods ORFs are plentiful: • Problem is: “which possible ORF contains the true gene, and which start site is correct?” • Still not a solved problem
  • 141. Finding Open Reading Frames The simplest approach: find ORFs (sequence between two consecutive in-frame stop codons) • ORF finding is naive, does not consider: • Start codon • Promoter/RBS motifs • Wider context (e.g. overlapping genes) Dedicated tools, e.g. Glimmer, Prodigal, RAST, GeneMarkS usually better.
  • 142. Two ab initio CDS Prediction Tools • Glimmer45 • Interpolated Markov models • Can be trained on “gold standard” datasets • Prodigal46 • Log-likelihood model based on GC frame plots, followed by dynamic programming • Can be trained on “gold standard” datasets Applying these to an example bacterial chromosome. . . 45 Delcher et al. (2007) Bioinformatics 23:673-679 doi:10.1093/bioinformatics/btm009 46 Hyatt et al. (2010) BMC Bioinf. 11:119 doi:10.1186/1471-2105-11-119
  • 143. Comparing predictions in Artemisa a Carver et al. (2012) Bioinformatics 28:464-469 doi:10.1093/bioinformatics/btr703 Not every ORF (green) is predicted to encode for a coding sequence (CDS; blue/orange). Self-contradictory CDS calls (orange); even automated annotation needs manual curation.
  • 144. Comparing predictions in Artemis Glimmer(green)/Prodigal(blue) CDS prediction methods do not always agree (presence/absence, start position). How do we know which (if either) is best?
  • 145. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 146. Using a “Gold Standard”: validationa a Pritchard and Broadhurst (2014) Methods Mol. Biol. 1127:53-64 doi:10.1007/978-1-62703-986-4 4 A general approach for all predictive methods • Define a known, “correct” set of true/false, positive/negative etc. examples - the “gold standard” • Evaluate your predictive method against that set for • sensitivity, specificity, accuracy, precision, etc. This ought to be done by the method developers, but often wise to evaluate in your own system. Many methods available, coverage beyond the scope of this introduction
  • 147. Contingency Tables Condition (Gold standard) True False Test outcome Positive True Positive False Positive Negative False Negative True Negative Performance Metrics Sensitivity = TPR = TP/(TP + FN) Specificity = TNR = TN/(FP + TN) FPR = 1 − Specificity = FP/(FP + TN) If you don’t have this information, you can’t interpret predictive results properly.
  • 148. “Gold Standard” results • Tested glimmer47 and prodigal48 on two enterobacterial close relatives as “gold standards” (still not perfect. . .) 1. Manually annotated (>3 expert person years) 2. Community-annotated (many research groups, interested in their own subset of genes) • Both methods trained directly on the annotated genes in each organism! 47 Delcher et al. (2007) Bioinformatics 23:673-679 doi:10.1093/bioinformatics/btm009 48 Hyatt et al. (2010) BMC Bioinf. 11:119 doi:10.1186/1471-2105-11-119
  • 149. “Gold Standard” results Manually annotated: 4550 CDS genecaller glimmer prodigal predicted 4752 4287 missed 284 (6%) 407 (9%) Exact Prediction sensitivity 62% 71% FDR 41% 25% PPV 59% 75% Correct ORF sensitivity 94% 91% FDR 10% 3% PPV 90% 97%
  • 150. “Gold Standard” results Community annotated: 4475 CDS genecaller glimmer prodigal predicted 4679 4467 missed 112 (3%) 156 (3%) Exact Prediction sensitivity 62% 86% FDR 31% 14% PPV 69% 86% Correct ORF sensitivity 97% 97% FDR 7% 3% PPV 93% 97%
  • 151. Gene/CDS Prediction • Alternative CDS (and all other) prediction methods are unlikely to give identical results, or perform equally well • There is No Free Lunch (this is a theorem: http://en.wikipedia.org/wiki/No free lunch theorem) • To assess/choose between methods, performance metrics are required • Even on prokaryotes (a relatively simple case), current best methods for CDS prediction are imperfect • Manual correction is often required (usually the most demanding and time-consuming part of the process).
  • 152. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 153. Prokaryotic Annotation Pipelinesa a Richardson and Watson (2012) Brief. Bioinf. 14:1-12 doi:10.1093/bib/bbs007 Many choices, including RAST49, PROKKA50, BaSYS51, etc. Often perform both CDS/feature calling and functional prediction. Two broad approaches: 1. Heavyweight: maintain database and resource, often annotating by homology, e.g. RAST 2. Lightweight: chain together multiple third-party packages, e.g. PROKKA Pipelines take a lot of tedium (and control) out of annotating bacterial genomes, but have the same issues as every other prediction tool. 49 Aziz et al. (2008) BMC Genomics 9:75 doi:10.1186/1471-2164-9-75 50 Seemann (2014) Bioinformatics 30:2068-2069 doi:10.1093/bioinformatics/btu153 51 Van Domselaar et al. (2005) Nuc. Acids Res. 33:W455-W459 doi:10.1093/nar/gki593
  • 154. PROKKAa a Seemann (2014) Bioinformatics 30:2068-2069 doi:10.1093/bioinformatics/btu153 • Lightweight, and fast. • Runs locally. (5Mbp genome takes ≈10min on my desktop; more detailed ncRNA prediction takes ≈20min) • Flexible: built-in databases can be replaced by user databases. • Uses freely-accessible third-party tools for prediction Simple to run (at the command-line, or in Galaxy52). 52 Goecks et al. (2010) Genome Biol. 11:R86 doi:10.1186/gb-2010-11-8-r86
  • 155. RASTa a Aziz et al. (2008) BMC Genomics 9:75 doi:10.1186/1471-2164-9-75 • Server-based (http://rast.nmpdr.org/). Queues likely. • Relies on SEED and FIGFam databases, held at NMPDR • FIGFam: isofunctional homologue families • Produces metabolic reconstruction
  • 156. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 157. Principles of function prediction At genome scale, we realistically have to automate function prediction. Function prediction is just like any other prediction method. “Does this sequence imply that function?” Two main approaches to function prediction: • ab initio prediction (on basis of feature sequence/context only) • Unsupervised methods - not trained on an exemplar dataset • Supervised methods - trained on an exemplar dataset • homology matches (sequence similarity) • alignment to features with known/predicted functions
  • 158. Homology-based function prediction Two proteins with similar sequence may have similar function. But. . . • How similar do they have to be (and where) to share the same function? • What do we mean by ‘same function’, anyway? Interaction/substrate specificity? Participation in a pathway? Contribution to a structure? Biochemical interconversion? . . . • How confident can we be in the comparator (annotated) sequence: was that function determined experimentally?
  • 159. Gene Ontology (GO)a a Ashburner et al. (2000) Nat. Genet. 25:25-29 doi:10.1038/75556 The Gene Ontology provides a common vocabulary for describing biological function, and unifying functional descriptions. Ontologies (controlled vocabularies) are central to information-sharing. Gene Ontology Consortium: http://geneontology.org/ Many annotation tools and databases produce GO output, or compatible controlled vocabulary terms, e.g. • Blast2GO53: BLAST-based annotation • PHI-Base54: microbial pathogen-host interaction specific functions • GOPred55: combines several protein function classifiers 53 Conesa et al. (2005) Bioinformatics 21:3674-3676 doi:10.1093/bioinformatics/bti610 54 Winnenburg et al. (2006) Nuc. Acids Res. 34:D459-D464 doi:10.1093/nar/gkj047 55 Sarac et al. (2010) PLoS One 5:e12382 doi:10.1371/journal.pone.0012382
  • 160. Gene Ontology (GO)a a Ashburner et al. (2000) Nat. Genet. 25:25-29 doi:10.1038/75556
  • 161. Gene Ontology (GO)a a Ashburner et al. (2000) Nat. Genet. 25:25-29 doi:10.1038/75556
  • 162. Are database annotations reliable?a a Schnoes et al. (2013) PLoS Comp. Biol. 9:e1003063 doi:10.1371/journal.pcbi.1003063 Are protein function annotations in databases determined experimentally, or by annotation transfer? High throughput experiments and genome annotations are conducted without validation of function, and placed in databases. • GO databases record annotation origin by publication • GO databases record evidence codes, e.g.: EXP=Inferred from Experiment; ISS=Inferred from Sequence Similarity • 0.14% of contributing publications provide 25% of all experimentally validated annotations in the Uniprot-GOA compilation. • There are biases in functional annotation. No clear solution to this kind of bias - but we have to recognise and account for it.
  • 163. Are database annotations reliable?a a Radivojac et al. (2013) Nat. Meth. 10:221-227 doi:10.1038/nmeth.2340 The Critical Assessment of Function Annotation (CAFA) project.
  • 164. Do biased database annotations matter? Experimental annotations of proteins are incomplete. But is that important? Tested by simulation, and following databases for three years.56 1. Yes. It matters. 2. Current large scale annotations are meaningful and almost surprisingly reliable. 3. The nature and level of data incompleteness, and type of classification model have an effect. 4. “Low precision, high recall” (i.e. less discriminating) tools most significantly affected. Molecular function prediction is usually more reliable than biological process prediction57 56 Jiang et al. (2014) Bioinformatics 30:i609-i616 doi:10.1093/bioinformatics/btu472 57 Cozzetto et al. (2013) BMC Bioinf. 14:S3-S1 doi:10.1186/1471-2105-14-S3-S1
  • 165. CAFA resultsa a Radivojac et al. (2013) Nat. Meth. 10:221-227 doi:10.1038/nmeth.2340 The Critical Assessment of Function Annotation (CAFA) 2013 results. (F-measure combines precision and recall) • You can do better than BLAST. • Best-performing methods do comparably well. • Best methods used evolutionary relationships, structure, and expression data. • Machine Learning works best.
  • 166. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 167. A wee trip to the doctor • You go for a checkup, and are tested for disease X • The test has sensitivity = 0.95 (predicts disease where there is disease) • The test has FPR = 0.01 (predicts disease where there is no disease)
  • 168. A wee trip to the doctor • You go for a checkup, and are tested for disease X • The test has sensitivity = 0.95 (predicts disease where there is disease) • The test has FPR = 0.01 (predicts disease where there is no disease) • Your test is positive • What is the probability that you have disease X? • 0.01, 0.05, 0.50, 0.95, 0.99? • (Audience Participation!)
  • 169. A wee trip to the doctor • What is the probability that you have disease X? • Unless you know the baseline occurrence of disease X, you cannot determine this.
  • 170. A wee trip to the doctor • What is the probability that you have disease X? • Unless you know the baseline occurrence of disease X, you cannot determine this. • Baseline occurrence: fX • fX = 0.01 =⇒ P(disease|+ve) = 0.490 ≈ 0.5 • fX = 0.8 =⇒ P(disease|+ve) = 0.997 ≈ 1.0
  • 171. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 172. Why Performance Metrics Mattera a Pritchard and Broadhurst (2014) Methods Mol. Biol. 1127:53-64 doi:10.1007/978-1-62703-986-4 4 • Imagine a paper describing a predictor for protein functional class (e.g. Type III effector) • The paper reports sensitivity = 0.95, FPR = 0.01 • You run the predictor on 4,500 proteins in a new genome • It predicts 50 members of the class. How many of them are likely to be true positives?
  • 173. Why Performance Metrics Mattera a Pritchard and Broadhurst (2014) Methods Mol. Biol. 1127:53-64 doi:10.1007/978-1-62703-986-4 4 • Imagine a paper describing a predictor for protein functional class (e.g. Type III effector) • The paper reports sensitivity = 0.95, FPR = 0.01 • You run the predictor on 4,500 proteins in a new genome • It predicts 50 members of the class. How many of them are likely to be true positives? • We need a baseline level of that class (fX ) in the genome to determine this. • We estimate ≈ 45 members in protein complement, so fX = 0.01 • fX = 0.01 =⇒ P(class|+ve) = 0.490 ≈ 0.5
  • 174. Bayes’ Theorem • May seem counter-intuitive: 95% sensitivity, 99% specificity =⇒ 50% chance of any prediction being incorrect • Probability given by Bayes’ Theorem • P(X|+) = P(+|X)P(X) P(+|X)P(X)+P(+| ¯X)P( ¯X)
  • 175. Let’s play a game. . . 2, 4, 6, . . .
  • 176. Bayes’ Theorem • May seem counter-intuitive: 95% sensitivity, 99% specificity =⇒ 50% chance of any prediction being incorrect • Probability given by Bayes’ Theorem • P(X|+) = P(+|X)P(X) P(+|X)P(X)+P(+| ¯X)P( ¯X) • This step commonly overlooked in the literature • confirmation bias • people want to see positive examples/tell a story • people want to think their predictor works
  • 177. A cautionary talea a Arnold et al. (2009) PLoS Pathog. 5:e1000376 doi:10.1371/journal.ppat.1000376 • Paper describes EffectiveT3, a type III effector prediction tool • Reported sensitivity ≈ 0.71, FPR ≈ 0.15 • Applied tool to 739 complete bacterial and archaeal genomes
  • 178. A cautionary talea a Arnold et al. (2009) PLoS Pathog. 5:e1000376 doi:10.1371/journal.ppat.1000376 • Paper describes EffectiveT3, a type III effector prediction tool • Reported sensitivity ≈ 0.71, FPR ≈ 0.15 • Applied tool to 739 complete bacterial and archaeal genomes • Organisms with an identifiable T3SS: 2-7% of genome predicted to be secreted • Organisms without an identifiable T3SS (or known not to have one): 1-10% of genome predicted to be secreted • “The surprisingly high number of (false) positives in genomes without T3SS exceeds the expected false positive rate” • This is not a surprise, statistically.
  • 179. A cautionary talea a Arnold et al. (2009) PLoS Pathog. 5:e1000376 doi:10.1371/journal.ppat.1000376 Probability that an EffectiveT3 positive prediction corresponds to a secreted protein is given by Bayes’ Theorem • P(X|+) = P(+|X)P(X) P(+|X)P(X)+P(+| ¯X)P( ¯X) • P(+|X) = sensitivity = 0.71 • P(+| ¯X) = FPR = 0.15 • P(X) = base rate ≈ 0.03 (58) • =⇒ P(X|+) ≈ 0.13 Only 13% of predictions likely to be positive! How many predicted type III secreted proteins were there. . . 58 Boch and Bonas (2010) Annu. Rev. Phytopathol. 48:419-436 doi:10.1146/annurev-phyto-080508-081936
  • 180. A cautionary talea a Arnold et al. (2009) PLoS Pathog. 5:e1000376 doi:10.1371/journal.ppat.1000376
  • 181. Interpreting genome-scale predictionsa a Pritchard and Broadhurst (2014) Methods Mol. Biol. 1127:53-64 doi:10.1007/978-1-62703-986-4 4 • Statistics at genome-scale can be counterintuitive. • Use Bayes’ Theorem! • Predictions identify groups, not individual members of the group. e.g. • Test for airport smugglers has P(smuggler|+) = 0.9 • Test gives 100 positives • Which specific individuals are truly smugglers?
  • 182. Interpreting genome-scale predictionsa a Pritchard and Broadhurst (2014) Methods Mol. Biol. 1127:53-64 doi:10.1007/978-1-62703-986-4 4 • Statistics at genome-scale can be counterintuitive. • Use Bayes’ Theorem! • Predictions identify groups, not individual members of the group. e.g. • Test for airport smugglers has P(smuggler|+) = 0.9 • Test gives 100 positives • Which specific individuals are truly smugglers? • The test does not allow you to determine this - you need more evidence for each individual • Same principle applies to other classifiers, (including protein functional class prediction) - watch for ‘cherry-picking’ in publications
  • 183. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 184. Reconstructing metabolisma a Thiele and Palsson (2010) Nat. Protoc. 5:93-121 doi:10.1038/nprot.2009.203 Once metabolic functional annotation has been assigned to features, we can do comparative analysis of metabolism.
  • 185. Dynamic models of metabolisma a Orth et al. (2010) Nat. Biotech. 28:245-248 doi:10.1038/nbt.1614 By using constraint-based models (e.g. Flux Balance Analysis), we can make these into dynamic representations of bacterial metabolism. • Upper, lower bounds to reaction rates • Define objective phenotype • Calculate conditions resulting in flux • in silico knockouts
  • 186. E. coli metabolisma a Monk et al. (2013) Proc. Natl. Acad. Sci. USA 110:20338-20343 doi:10.1073/pnas.1307797110 E. coli has a very long history of metabolic reconstruction59 Recent modelling work predicts which nutrients support growth 59 Reed and Palsson (2000) J. Bact. 185:2692-2699 doi:10.1128/JB.185.9.2692-2699.2003
  • 187. E. coli metabolisma a Baumler et al. (2011) BMC Syst. Biol. 5:182 doi:10.1186/1752-0509-5-182 Models are complex, and experimental validation is essential There’s more we don’t know. . .
  • 188. Microbial Genomics and Bioinformatics BM405 5.Finding Equivalent Features Leighton Pritchard1,2,3 1 Information and Computational Sciences, 2 Centre for Human and Animal Pathogens in the Environment, 3 Dundee Effector Consortium, The James Hutton Institute, Invergowrie, Dundee, Scotland, DD2 5DA
  • 189. Acceptable Use Policy Recording of this talk, taking photos, discussing the content using email, Twitter, blogs, etc. is permitted (and encouraged), providing distraction to others is minimised. These slides will be made available on SlideShare. These slides, and supporting material including exercises, are available at https://github.com/widdowquinn/Teaching- Strathclyde-BM405
  • 190. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 191. What makes genome features equiva- lent? When we compare two features (e.g. genes) between two or more genomes, there must be some basis for making the comparison That is, they have to be equivalent in some way, such as: • common evolutionary origin • functional similarity • a family-based relationship It’s common to define equivalence of genome features in terms of evolutionary relationship.
  • 192. Why look at equivalent features? The real power of genomics is comparative genomics! • Makes catalogues of genome components comparable between organisms • Differences, e.g. presence/absence of equivalents may support hypotheses for functional or phenotypic difference • Can identify characteristic signals for diagnosis/epidemiology • Can build parts lists and wiring diagrams for systems and synthetic biology
  • 193. Evolutionary relationshipsa a Fitch (1970) Syst. Zool. 19:99-113 doi:10.2307/2412448 Equivalencies and relationships can be quite complex. We need precise terms to describe relationships between genome features. • analogy: functional similarity • homology: evolutionary common ancestor
  • 194. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 195. Who let the -logues out?a a Fitch (2000) Trends Genet. 16:227-231 doi:10.1016/S0168-9525(00)02005-9 • homologues: elements that are similar because they share a common ancestor. There are NOT degrees of homology • analogues: elements that are (functionally?) similar, and this may be through common ancestry or some other means, e.g. convergent evolution • orthologues: homologues that diverged through speciation • paralogues: homologues that diverged through duplication within the same genome
  • 196. Who let the -logues out?
  • 197. Who let the -logues out?
  • 198. Who let the -logues out?
  • 199. Who let the -logues out?
  • 200. Who let the -logues out?
  • 201. ITYFIALMCTTa a Kristensen et al. (2011) Brief. Bioinf. 12:379-391 doi:10.1093/bib/bbr030 But it’s a little more complicated than that. Biology is not well-behaved. • Gene loss • Homologues may diverge so widely that they can be hard to recognise • Reconstructed evolutionary trees may not be robust inferences of speciation (or relevant to it, in prokaryotes) • There is no record of history - we can only make inferences All classifications of orthology/paralogy are inferences!
  • 202. ITYFIALMCTTa a Kristensen et al. (2011) Brief. Bioinf. 12:379-391 doi:10.1093/bib/bbr030 All classifications of orthology/paralogy are inferences!
  • 203. Ensembl Comparaa a Vilella et al. (2009) Genome Res. 19:327-335 doi:10.1101/gr.073585.107 Some tools/databases, e.g. Ensembl Compara, use slightly different definitions (almost everything’s an “orthologue”)
  • 204. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 205. Why focus on orthologues? Formalise the idea of corresponding genes in different organisms. Orthologues serve two purposes: • Evolutionary equivalence • Functional equivalence (“The Ortholog Conjecture”60) Applications in comparative genomics, functional genomics and phylogenetics.61 Over 30 databases attempt to describe orthologous relationships (http://questfororthologs.org/orthology databases62) 60 Chen and Zhang (2012) PLoS Comp. Biol. 8:e1002784 doi:10.1371/journal.pcbi.1002784 61 Dessimoz (2011) Brief. Bioinf. 12:375-376 doi:10.1093/bib/bbr057 62 Altenhoff and Dessimoz (2009) PLoS Comp. Biol. 5:e1000262 doi:10.1371/journal.pcbi.1000262
  • 206. Finding orthologues Multiple methods and databases63,64,65 • Pairwise genome • RBBH (aka BBH, RBH), RSD, InParanoid, RoundUp • Multi-genome • Graph-based: COG, eggNOG, OrthoDB, OrthoMCL, OMA, MultiParanoid • Tree-based: TreeFam, Ensembl Compara, PhylomeDB, LOFT 63 Kristensen et al. (2011) Brief. Bioinf. 12:379-391 doi:10.1093/bib/bbr030 64 Trachana et al. (2011) Bioessays 33:769-780 doi:10.1002/bies.201100062 65 Salichos and Rokas (2011) PLoS One 6:e18755 doi:10.1371/journal.pone.0018755.g006
  • 207. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 208. Which prediction methods work best? Taking advantage of prokaryotic operon structure: if the outer pair of a syntenic triplet of genes are orthologous, the middle gene is also likely to be orthologous.66 Specifically testing reciprocal best hits (RBH). 66 Wolf and Koonin (2012) Genome Biol. Evil. 4:1286-1294 doi:10.1093/gbe/evs100
  • 209. Which prediction methods work best? • Tested on 573 prokaryotic genomes • 88-99% of RBH found in syntenic triplets • Overwhelming majority of middle genes are RBH RBH reliably finds orthologues.67 67 Wolf and Koonin (2012) Genome Biol. Evil. 4:1286-1294 doi:10.1093/gbe/evs100
  • 210. Which prediction methods work best? Four methods tested against 2,723 curated orthologues from six Saccharomycetes • RBBH (and cRBH); RSD (and cRSD); MultiParanoid; OrthoMCL • Rated by statistical performance metrics: sensitivity, specificity, accuracy, FDR cRBH most accurate and specific, with lowest FDR.68 68 Salichos and Rokas (2011) PLoS One 6:e18755 doi:10.1371/journal.pone.0018755.g006
  • 211. Which prediction methods work best? Testing on literature-based benchmarks for grouping by function and correct branching of phylogeny.69 69 Altenhoff and Dessimoz (2009) PLoS Comp. Biol. 5:e1000262 doi:10.1371/journal.pcbi.1000262
  • 212. Which prediction methods work best? • Performance varies by choice of method, and interpretation of “orthology” • Biggest influence is genome annotation quality • Relative performance varies with choice of benchmark • (clustering) RBH outperforms more complex algorithms under many circumstances
  • 213. What is this magic RBH method?
  • 214. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 215. Functional adaptation in Pbaa a Toth et al. (2006) Ann. Rev. Phytopath. 44:305-336 doi:10.1146/annurev.phyto.44.070505.143444
  • 216. Functional adaptation in Pbaa a Toth et al. (2006) Ann. Rev. Phytopath. 44:305-336 doi:10.1146/annurev.phyto.44.070505.143444
  • 217. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 218. Core genome Once equivalent genes have been identified, those present in all related isolates can be identified: the core genome. The core genome is expected to underpin common function. A core RBH cluster (clique) for 29 genomes:
  • 219. Accessory genome The remaining genes are the accessory genome, and are expected to mediate function that distinguishes between isolates. An accessory RBH cluster for 29 genomes:
  • 220. Accessory clusters Accessory RBH clusters can be pruned, to identify the accessory genome specific to subgroups of isolates: These genes may be responsible for subgroup-specific phenotypes
  • 221. Accessory genome Accessory genomes act as a cradle for adaptive evolution70 This is particularly so for pathogens, such as Pseudomonas spp.71 70 Croll and Mcdonald (2012) PLoS Path. 8:e1002608 doi:10.1371/journal.ppat.1002608 71 Baltrus et al. (2011) PLoS Path. 7:e1002132 doi:10.1371/journal.ppat.1002132.t002
  • 222. Core genome synteny Using tools like i-ADHoRe72 that identify synteny and collinearity, the structural organisation of the core genome can be determined: For Dickeya, the core genome appears to be structurally well-conserved across all isolates. 72 Proost et al. (2012) Nuc. Acids Res. 40:e11 doi:10.1093/nar/gkr955
  • 223. Panseqa a Laing et al. (2010) BMC Bioinf. 11:461 doi:10.1186/1471-2105-11-461 Panseq is an online tool for identification of core and accessory genomes, available at https://lfz.corefacility.ca/panseq/, and https://github.com/chadlaing/Panseq for standalone use
  • 224. Harvesta a Treangen et al. (2014) Genome Biol. 15:524 doi:10.1186/s13059-014-0524-x Visualising and organising comparison/pangenome data across thousands of bacteria is difficult. The Harvest suite of tools enables alignment and visualisation of thousands of genomes:
  • 225. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 226. Things I didn’t get to
  • 227. Table of Contents Introduction A personal view Erwinia carotovora subsp. atroseptica Dickeya spp., Campylobacter spp., and Escherichia coli So what’s changed? High Throughput Sequencing Three revolutions, four dominant technologies Benchmarking Nanopore How fast is sequence data increasing? Sequence Data Formats FASTQ SAM/BAM/CRAM Repositories Assembly Overlap-Layout-Consensus de Bruijn graph assembly Read Mapping Short-Read Sequence Alignment The Assembly What you get back Comparative Genomics Computational Comparative Genomics Bulk Genome Properties Nucleotide Frequency/Genome Size Whole Genome Alignment An Introduction to Pairwise Genome Alignment Average Nucleotide Identity Whole Genome Alignment in Practice Ordering Draft Genomes By Alignment Chromosome painting Nosocomial P.aeruginosa acquisition Genome Features What are genome features? Prokaryotic CDS Prediction Assessing Prediction Methods Prokaryotic Annotation Pipelines Genome-Scale Functional Annotation Functional Annotation A visit to the doctor Statistics of genome-scale prediction Building to Metabolism Reconstructing metabolism Equivalent Genome Features What makes genome features equivalent? Homology, Orthology, Paralogy Who let the -logues out? What’s so important about orthologues? Evaluating orthologue prediction Using orthologue predictions Core and Pan-genomes Conclusions Things I Didn’t Get To Conclusions
  • 231. Licence: CC-BY-SA By: Leighton Pritchard This presentation is licensed under the Creative Commons Attribution ShareAlike license https://creativecommons.org/licenses/by-sa/4.0/