SlideShare a Scribd company logo
Phylogeny driven approaches
to the study of microbial diversity
September 3, 2015
Queenstown Computational Genomics
Conference
Jonathan A. Eisen
@phylogenomics
University of California, Davis
0
1000
2000
3000
4000
00 01 02 03 04 05 06 07 08 09 10 11 12 13
Pubmed “Microbiome” Hits
The Rise of the Microbiome
microBIOME or microbiOME
• microbi-OME
• collection of genomes of microbes from a
community (emphasis on OME)
• micro-BIOME
• a community of microbes (emphasis on
BIOME)
• see http://tinyurl.com/definemicrobiome
Not Just About Humans or Hosts
Why Now?
Why Now I: Appreciation of Microbial Diversity
Functional Diversity
Diversity of Form
Phylogenetic Diversity
Why Now I: Appreciation of Microbial Diversity
Functional Diversity
Diversity of Form
Phylogenetic Diversity
MICROBES
RUN THE
PLANET
Why Now II: Post Genome Blues
The Microbiome
Transcriptome
VariomeEpigenome
Overselling the Human Genome?
<<<<
Culturing Observation
CountCount
http://www.google.com/url?
sa=i&rct=j&q=&esrc=s&source=images&
cd=&docid=rLu5sL207WlE1M&tbnid=CR
LQYP7d9d_TcM:&ved=0CAUQjRw&url=h
ttp%3A%2F%2Fwww.biol.unt.edu
%2F~jajohnson
%2FDNA_sequencing_process&ei=hFu7
U_TyCtOqsQSu9YGwBg&psig=AFQjCN
G-8EBdEljE7-
yHFG2KPuBZt8kIPw&ust=140487395121
1424
DNA
Why Now III: CSI-Microbiology Advances
Why Now IV: Sequencing Has Gone Crazy
Sequencing Revolution
!10
•More genes and genomes
•Deeper sequencing
• The rare biosphere
• Relative abundance estimates
•More samples (with barcoding)
• Times series
• Spatially diverse sampling
• Fine scale sampling
Turnbaugh et al Nature. 2006 444(7122):1027-31.
Why Now V: Microbiome Functions
Uses of Phylogeny 1: Species Phylogeny
Woese: Classification of Cultured Taxa by rRNA
!13
rRNA rRNArRNA
ACUGC
ACCUAU
CGUUCG
ACUCC
AGCUAU
CGAUCG
ACCCC
AGCUCU
CGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
R ACUCCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
F ACUCCAGGUAUCGAUCG
C ACCCCAGCUCUCGCUCG
W ACCCCAGCUCUGGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
C ACCCCAGCUCUCGCUCG
EukaryotesBacteria ?????ArchaebacteriaArchaea
Isolate Ribosomes
Archaea
Woese: Classification of Cultured Taxa by rRNA PCR
!15
rRNA
rRNA
PCR
rRNA
PCR
ACUGC
ACCUAU
CGUUCG
ACUCC
AGCUAU
CGAUCG
ACCCC
AGCUCU
CGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
R ACUCCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
F ACUCCAGGUAUCGAUCG
C ACCCCAGCUCUCGCUCG
W ACCCCAGCUCUGGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
C ACCCCAGCUCUCGCUCG
EukaryotesBacteria
Isolate DNA
Archaea
!16
rRNA
rRNA
PCR
rRNA
PCR
EukaryotesBacteria
Isolate DNA
ACTGC
ACCTAT
CGTTCG
ACTGC
ACCTAT
CGTTCG
ACTGC
ACCTAT
CGTTCG
Taxa Characters
B1 ACTGCACCTATCGTTCG
B2 ACTCCACCTATCGTTCG
E1 ACTCCAGCTATCGATCG
E2 ACTCCAGGTATCGATCG
A1 ACCCCAGCTCTCGCTCG
A2 ACCCCAGCTCTGGCTCG
New1 ACTGCACCTATCGTTCG
Phylotyping via rRNA PCR: One Taxon
Chemosymbiont rRNA Phylotyping
!17
Eisen et al. 1992. J. Bact.174: 3416Colleen Cavanaugh
Taxa Characters
B1 ACTGCACCTATCGTTCG
B2 ACTCCACCTATCGTTCG
E1 ACTCCAGCTATCGATCG
E2 ACTCCAGGTATCGATCG
A1 ACCCCAGCTCTCGCTCG
A2 ACCCCAGCTCTGGCTCG
New1 ACCCCAGCTCTGCCTCG
New2 ACTGCACCTATCGTTCG
Archaea EukaryotesBacteria
ACTGC
ACCTAT
CGTTCG
ACTGC
ACCTAT
CGTTCG
ACCCC
AGCTCT
CGCTCG
!18
rRNA
rRNA
PCR
rRNA
PCR
Isolate DNA
Phylotyping via rRNA PCR: Two Taxa
ACTGC
ACCTAT
CGTTCG
ACTCC
AGCTAT
CGATCG
ACCCC
AGCTCT
CGCTCG
AGGGG
AGCTCT
CGCTCG
AGGGG
AGCTCT
CGCTCG
ACTGC
ACCTAT
CGTTCG
Taxa Characters
B1 ACTGCACCTATCGTTCG
B2 ACTCCACCTATCGTTCG
E1 ACTCCAGCTATCGATCG
E2 ACTCCAGGTATCGATCG
A1 ACCCCAGCTCTCGCTCG
A2 ACCCCAGCTCTGGCTCG
New1 ACCCCAGCTCTGCCTCG
New2 ACTGCACCTATCGTTCG
New3 ACCCCAGCTCTCGCTCG

New4 AGGGGAGCTCTCGCTCG
Archaea EukaryotesBacteria
!19
rRNA
rRNA
PCR
rRNA
PCR
Isolate DNA
Phylotyping via rRNA PCR: Four Taxa
Similarity vs. Phylogeny
!20
!21
Approaching to NGS
Discovery of DNA structure
(Cold Spring Harb. Symp. Quant. Biol. 1953;18:123-31)
1953
Sanger sequencing method by F. Sanger
(PNAS ,1977, 74: 560-564)
1977
PCR by K. Mullis
(Cold Spring Harb Symp Quant Biol. 1986;51 Pt 1:263-73)
1983
Development of pyrosequencing
(Anal. Biochem., 1993, 208: 171-175; Science ,1998, 281: 363-365)
1993
1980
1990
2000
2010
Single molecule emulsion PCR 1998
Human Genome Project
(Nature , 2001, 409: 860–92; Science, 2001, 291: 1304–1351)
Founded 454 Life Science 2000
454 GS20 sequencer
(First NGS sequencer)
2005
Founded Solexa 1998
Solexa Genome Analyzer
(First short-read NGS sequencer)
2006
GS FLX sequencer
(NGS with 400-500 bp read lenght)
2008
Hi-Seq2000
(200Gbp per Flow Cell)
2010
Illumina acquires Solexa
(Illumina enters the NGS business)
2006
ABI SOLiD
(Short-read sequencer based upon ligation)
2007
Roche acquires 454 Life Sciences
(Roche enters the NGS business)
2007
NGS Human Genome sequencing
(First Human Genome sequencing based upon NGS technology)
2008
From Slideshare presentation of Cosentino Cristian
http://www.slideshare.net/cosentia/high-throughput-equencing
Miseq
Roche Jr
Ion Torrent
PacBio
Oxford
Automation is Critical
AAATCGCTAGCGC
CGGCGAGCTAGC
CGAGCGATCGAGC
CGAGCATCGAGTA
STAP (for rRNA)
An Automated Phylogenetic Tree-Based Small Subunit
rRNA Taxonomy and Alignment Pipeline (STAP)
Dongying Wu1
*, Amber Hartman1,6
, Naomi Ward4,5
, Jonathan A. Eisen1,2,3
1 UC Davis Genome Center, University of California Davis, Davis, California, United States of America, 2 Section of Evolution and Ecology, College of Biological Sciences,
University of California Davis, Davis, California, United States of America, 3 Department of Medical Microbiology and Immunology, School of Medicine, University of
California Davis, Davis, California, United States of America, 4 Department of Molecular Biology, University of Wyoming, Laramie, Wyoming, United States of America,
5 Center of Marine Biotechnology, Baltimore, Maryland, United States of America, 6 The Johns Hopkins University, Department of Biology, Baltimore, Maryland, United
States of America
Abstract
Comparative analysis of small-subunit ribosomal RNA (ss-rRNA) gene sequences forms the basis for much of what we know
about the phylogenetic diversity of both cultured and uncultured microorganisms. As sequencing costs continue to decline
and throughput increases, sequences of ss-rRNA genes are being obtained at an ever-increasing rate. This increasing flow of
data has opened many new windows into microbial diversity and evolution, and at the same time has created significant
methodological challenges. Those processes which commonly require time-consuming human intervention, such as the
preparation of multiple sequence alignments, simply cannot keep up with the flood of incoming data. Fully automated
methods of analysis are needed. Notably, existing automated methods avoid one or more steps that, though
computationally costly or difficult, we consider to be important. In particular, we regard both the building of multiple
sequence alignments and the performance of high quality phylogenetic analysis to be necessary. We describe here our fully-
automated ss-rRNA taxonomy and alignment pipeline (STAP). It generates both high-quality multiple sequence alignments
and phylogenetic trees, and thus can be used for multiple purposes including phylogenetically-based taxonomic
assignments and analysis of species diversity in environmental samples. The pipeline combines publicly-available packages
(PHYML, BLASTN and CLUSTALW) with our automatic alignment, masking, and tree-parsing programs. Most importantly,
this automated process yields results comparable to those achievable by manual analysis, yet offers speed and capacity that
are unattainable by manual efforts.
Citation: Wu D, Hartman A, Ward N, Eisen JA (2008) An Automated Phylogenetic Tree-Based Small Subunit rRNA Taxonomy and Alignment Pipeline (STAP). PLoS
ONE 3(7): e2566. doi:10.1371/journal.pone.0002566
multiple alignment and phylogeny was deemed unfeasible.
However, this we believe can compromise the value of the results.
For example, the delineation of OTUs has also been automated
via tools that do not make use of alignments or phylogenetic trees
(e.g., Greengenes). This is usually done by carrying out pairwise
comparisons of sequences and then clustering of sequences that
have better than some cutoff threshold of similarity with each
other). This approach can be powerful (and reasonably efficient)
but it too has limitations. In particular, since multiple sequence
alignments are not used, one cannot carry out standard
phylogenetic analyses. In addition, without multiple sequence
alignments one might end up comparing and contrasting different
regions of a sequence depending on what it is paired with.
The limitations of avoiding multiple sequence alignments and
phylogenetic analysis are readily apparent in tools to classify
sequences. For example, the Ribosomal Database Project’s
Classifier program [29] focuses on composition characteristics of
each sequence (e.g., oligonucleotide frequency) and assigns
taxonomy based upon clustering genes by their composition.
Though this is fast and completely automatable, it can be misled in
cases where distantly related sequences have converged on similar
composition, something known to be a major problem in ss-rRNA
sequences [30]. Other taxonomy assignment systems focus
primarily on the similarity of sequences. The simplest of these is
classification tools it does have some limitations. For example,
the generation of new alignments for each sequence is both
computational costly, and does not take advantage of available
curated alignments that make use of ss-RNA secondary structure
to guide the primary sequence alignment. Perhaps most
importantly however is that the tool is not fully automated. In
addition, it does not generate multiple sequence alignments for all
sequences in a dataset which would be necessary for doing many
analyses.
Automated methods for analyzing rRNA sequences are also
available at the web sites for multiple rRNA centric databases,
such as Greengenes and the Ribosomal Database Project (RDPII).
Though these and other web sites offer diverse powerful tools, they
do have some limitations. For example, not all provide multiple
sequence alignments as output and few use phylogenetic
approaches for taxonomy assignments or other analyses. More
importantly, all provide only web-based interfaces and their
integrated software, (e.g., alignment and taxonomy assignment),
cannot be locally installed by the user. Therefore, the user cannot
take advantage of the speed and computing power of parallel
processing such as is available on linux clusters, or locally alter and
potentially tailor these programs to their individual computing
needs (Table 1).
Given the limited automated tools that are available for
Table 1. Comparison of STAP’s computational abilities relative to existing commonly-used ss-RNA analysis tools.
STAP ARB Greengenes RDP
Installed where? Locally Locally Web only Web only
User interface Command line GUI Web portal Web portal
Parallel processing YES NO NO NO
Manual curation for taxonomy assignment NO YES NO NO
Manual curation for alignment NO YES NO* NO
Open source YES** NO NO NO
Processing speed Fast Slow Medium Medium
It is important to note, that STAP is the only software that runs on the command line and can take advantage of parallel processing on linux clusters and, further, is
more amenable to downstream code manipulation.
*
Note: Greengenes alignment output is compatible with upload into ARB and downstream manual alignment.
**
The STAP program itself is open source, the programs it depends on are freely available but not open source.
doi:10.1371/journal.pone.0002566.t001
ss-rRNA Taxonomy Pipeline
STAP database, and the query sequence is aligned to them using
the CLUSTALW profile alignment algorithm [40] as described
above for domain assignment. By adapting the profile alignment
algorithm, th
while gaps ar
sequence ac
Figure 1. A flow chart of the STAP pipeline.
doi:10.1371/journal.pone.0002566.g001
STAP database, and the query sequence is aligned to them using
the CLUSTALW profile alignment algorithm [40] as described
above for domain assignment. By adapting the profile alignment
algorithm, the alignments from the STAP database remain intact,
while gaps are inserted and nucleotides are trimmed for the query
sequence according to the profile defined by the previous
alignments from the databases. Thus the accuracy and quality of
the alignment generated at this step depends heavily on the quality
of the Bacterial/Archaeal ss-rRNA alignments from the
Greengenes project or the Eukaryotic ss-rRNA alignments from
the RDPII project.
Phylogenetic analysis using multiple sequence alignments rests on
the assumption that the residues (nucleotides or amino acids) at the
same position in every sequence in the alignment are homologous.
Thus, columns in the alignment for which ‘‘positional homology’’
cannot be robustly determined must be excluded from subsequent
analyses. This process of evaluating homology and eliminating
questionable columns, known as masking, typically requires time-
consuming, skillful, human intervention. We designed an automat-
ed masking method for ss-rRNA alignments, thus eliminating this
bottleneck in high-throughput processing.
First, an alignment score is calculated for each aligned column
by a method similar to that used in the CLUSTALX package [42].
Specifically, an R-dimensional sequence space representing all the
possible nucleotide character states is defined. Then for each
aligned column, the nucleotide populating that column in each of
the aligned sequences is assigned a score in each of the R
dimensions (Sr) according to the IUB matrix [42]. The consensus
‘‘nucleotide’’ for each column (X) also has R dimensions, with the
Figure 2. Domain assignment. In Step 1, STAP assigns a domain to
each query sequence based on its position in a maximum likelihood
tree of representative ss-rRNA sequences. Because the tree illustrated
here is not rooted, domain assignment would not be accurate and
Figure 1. A flow chart of the STAP pipeline.
doi:10.1371/journal.pone.0002566.g001
ss-rRNA Taxonomy Pipeline
Dongying 

Wu
Amber
Hartman
Naomi Ward
alignment used to build the profile, resulting in a multiple PD versus PID clustering, 2) to explore overlap between PhylOT
Figure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylinders in this generaliz
workflow of PhylOTU. See Results section for details.
doi:10.1371/journal.pcbi.1001061.g001
Finding Metagenomic OTU
Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O'Dwyer JP, Green JL, Eisen JA, Pollard
KS. (2011) PhylOTU: A High-Throughput Procedure Quantifies Microbial Community Diversity
and Resolves Novel Taxa from Metagenomic Data. PLoS Comput Biol 7(1): e1001061. doi:
10.1371/journal.pcbi.1001061
PhylOTU
Tom Sharpton
Katie Pollard
Jessica Green
!24
rRNA PCR: Community Comparisons
Taxa Characters
B1 ACTGCACCTATCGTTCG
B2 ACTCCACCTATCGTTCG
E1 ACTCCAGCTATCGATCG
E2 ACTCCAGGTATCGATCG
A1 ACCCCAGCTCTCGCTCG
A2 ACCCCAGCTCTGGCTCG
New1 ACCCCAGCTCTGCCTCG
New2 ACTGCACCTATCGTTCG
New3 ACCCCAGCTCTCGCTCG

New4 AGGGGAGCTCTCGCTCG
Archaea EukaryotesBacteria
!24
rRNA
rRNA
PCR
rRNA
PCR
Isolate DNA
rRNA PCR: Community Comparisons
A A A A
AA
A A A A
AA
A A
A A A
AA
A A
Taxa Characters
B1 ACTGCACCTATCGTTCG
B2 ACTCCACCTATCGTTCG
E1 ACTCCAGCTATCGATCG
E2 ACTCCAGGTATCGATCG
A1 ACCCCAGCTCTCGCTCG
A2 ACCCCAGCTCTGGCTCG
New1 ACCCCAGCTCTGCCTCG
New2 ACTGCACCTATCGTTCG
New3 ACCCCAGCTCTCGCTCG

New4 AGGGGAGCTCTCGCTCG !25
rRNA
rRNA
PCR
rRNA
PCR
Isolate DNA
rRNA PCR: Community Comparisons
A A A A
AA
A A A A
AA
A A
A A A
AA
A A
Hartman et al. BMC Bioinformatics 2010, 11:317
http://www.biomedcentral.com/1471-2105/11/317
Open AccessSOFTWARE
Software
Introducing W.A.T.E.R.S.: a Workflow for the
Alignment, Taxonomy, and Ecology of Ribosomal
Sequences
Amber L Hartman†1,3, Sean Riddle†2, Timothy McPhillips2, Bertram Ludäscher2 and Jonathan A Eisen*1
Abstract
Background: For more than two decades microbiologists have used a highly conserved microbial gene as a
phylogenetic marker for bacteria and archaea. The small-subunit ribosomal RNA gene, also known as 16 S rRNA, is
encoded by ribosomal DNA, 16 S rDNA, and has provided a powerful comparative tool to microbial ecologists. Over
time, the microbial ecology field has matured from small-scale studies in a select number of environments to massive
collections of sequence data that are paired with dozens of corresponding collection variables. As the complexity of
data and tool sets have grown, the need for flexible automation and maintenance of the core processes of 16 S rDNA
sequence analysis has increased correspondingly.
Results: We present WATERS, an integrated approach for 16 S rDNA analysis that bundles a suite of publicly available 16
S rDNA analysis software tools into a single software package. The "toolkit" includes sequence alignment, chimera
removal, OTU determination, taxonomy assignment, phylogentic tree construction as well as a host of ecological
analysis and visualization tools. WATERS employs a flexible, collection-oriented 'workflow' approach using the open-
source Kepler system as a platform.
Conclusions: By packaging available software tools into a single automated workflow, WATERS simplifies 16 S rDNA
analyses, especially for those without specialized bioinformatics, programming expertise. In addition, WATERS, like
some of the newer comprehensive rRNA analysis tools, allows researchers to minimize the time dedicated to carrying
out tedious informatics steps and to focus their attention instead on the biological interpretation of the results. One
advantage of WATERS over other comprehensive tools is that the use of the Kepler workflow system facilitates result
interpretation and reproducibility via a data provenance sub-system. Furthermore, new "actors" can be added to the
workflow as desired and we see WATERS as an initial seed for a sizeable and growing repository of interoperable, easy-
to-combine tools for asking increasingly complex microbial ecology questions.
Background
Microbial communities and how they are surveyed
Microbial communities abound in nature and are crucial
for the success and diversity of ecosystems. There is no
end in sight to the number of biological questions that
can be asked about microbial diversity on earth. From
animal and human guts to open ocean surfaces and deep
sea hydrothermal vents, to anaerobic mud swamps or
boiling thermal pools, to the tops of the rainforest canopy
and the frozen Antarctic tundra, the composition of
microbial communities is a source of natural history,
intellectual curiosity, and reservoir of environmental
health [1]. Microbial communities are also mediators of
insight into global warming processes [2,3], agricultural
success [4], pathogenicity [5,6], and even human obesity
[7,8].
In the mid-1980 s, researchers began to sequence ribo-
somal RNAs from environmental samples in order to
characterize the types of microbes present in those sam-
ples, (e.g., [9,10]). This general approach was revolution-
ized by the invention of the polymerase chain reaction
(PCR), which made it relatively easy to clone and then
* Correspondence: jaeisen@ucdavis.edu
1 Department of Medical Microbiology and Immunology and the Department
of Evolution and Ecology, Genome Center, University of California Davis, One
Shields Avenue, Davis, CA, 95616, USA
† Contributed equally
Full list of author information is available at the end of the article
WATERS - Kepler Workflow for rRNA
matics 2010, 11:317
.com/1471-2105/11/317
Page 2 of 14
genes for ribosomal RNA) in partic-
ubunit ribosomal RNA (ss-rRNA).
ed a large amount of previously
l diversity [1,11-13]. Researchers
all subunit rRNA gene not only
ith which it can be PCR amplified,
has variable and highly conserved
to be universally distributed among
nd it is useful for inferring phyloge-
4,15]. Since then, "cultivation-inde-
" have brought a revolution to the
by allowing scientists to study a
mount of diversity in many different
ments [16-18]. The general premise
Figure 1 Overview of WATERS. Schema of WATERS where white
boxes indicate "behind the scenes" analyses that are performed in WA-
Align
Check
chimeras
Cluster Build
Tree
Assign
Taxonomy
Tree w/
Taxonomy
Diversity
statistics &
graphs
Unifrac
files
Cytoscape
network
OTU table
Hartman et al. BMC Bioinformatics 2010, 11:317
http://www.biomedcentral.com/1471-2105/11/317
Page 3 of 14
Motivations
As outlined above, successfully processing microbial
sequence collections is far from trivial. Each step is com-
plex and usually requires significant bioinformatics
expertise and time investment prior to the biological
interpretation. In order to both increase efficiency and
ensure that all best-practice tools are easily usable, we
sought to create an "all-inclusive" method for performing
all of these bioinformatics steps together in one package.
To this end, we have built an automated, user-friendly,
workflow-based system called WATERS: a Workflow for
the Alignment, Taxonomy, and Ecology of Ribosomal
Sequences (Fig. 1). In addition to being automated and
simple to use, because WATERS is executed in the Kepler
scientific workflow system (Fig. 2) it also has the advan-
tage that it keeps track of the data lineage and provenance
of data products [23,24].
Automation
The primary motivation in building WATERS was to
minimize the technical, bioinformatics challenges that
arise when performing DNA sequence clustering, phylo-
genetic tree, and statistical analyses by automating the 16
S rDNA analysis workflow. We also hoped to exploit
additional features that workflow-based approaches
entail, such as optimized execution and data lineage
tracking and browsing [23,25-27]. In the earlier days of 16
S rDNA analysis, simply knowing which microbes were
present and whether they were biologically novel was a
noteworthy achievement. It was reasonable and expected,
therefore, to invest a large amount of time and effort to
get to that list of microbes. But now that current efforts
are significantly more advanced and often require com-
parison of dozens of factors and variables with datasets of
thousands of sequences, it is not practically feasible to
process these large collections "by hand", and hugely inef-
ficient if instead automated methods can be successfully
employed.
Broadening the user base
A second motivation and perspective is that by minimiz-
ing the technical difficulty of 16 S rDNA analysis through
the use of WATERS, we aim to make the analysis of these
datasets more widely available and allow individuals with
Figure 2 Screenshot of WATERS in Kepler software. Key features: the library of actors un-collapsed and displayed on the left-hand side, the input
and output paths where the user declares the location of their input files and desired location for the results files. Each green box is an individual Kepler
actor that performs a single action on the data stream. The connectors (black arrows) direct and hook up the actors in a defined sequence. Double-
clicking on any actor or connector allows it to be manipulated and re-arranged.
Hartman et al. BMC Bioinformatics 2010, 11:317
http://www.biomedcentral.com/1471-2105/11/317
Page 9 of
default is 97% and 99%), and they are also generated for
every metadata variable comparison that the user
includes.
Data pruning
To assist in troubleshooting and quality contro
WATERS returns to the user three fasta files of sequenc
Figure 3 Biologically similar results automatically produced by WATERS on published colonic microbiota samples. (A) Rarefaction curves sim
ilar to curves shown in Eckburg et al. Fig. 2; 70-72, indicate patient numbers, i.e., 3 different individuals. (B) Weighted Unifrac analysis based on phylo
genetic tree and OTU data produced by WATERS very similar to Eckburg et al. Fig. 3B. (C) Neighbor-joining phylogenetic tree (Quicktree) representing
the sequences analyzed by WATERS, which is clearly similar to Fig. S1 in Eckburg et al.
BA
3 3HUFHQW YDULDWLRQ H[SODLQHG
33HUFHQWYDULDWLRQH[SODLQHG
$%
&
'(
)
6
$ %
&
'(
)
6
$
%&
'
()
6
3&$ 3 YV 3
C
%$&7(52,'(7(6
%$&7(52,'$/(6
'(/7$3527(2%$&7(5,$
$&7,12%$&7(5,$
9(558&20,&52%,$
(36,/213527(2%$&7(5,$
),50,&87(6
&/2675,',$
&/2675,',$/(6
*$00$3527(2%$&7(5,$
&<$12%$&7(5,$
$/3+$3527(2%$&7(5,$
)862%$&7(5,$
),50,&87(6
%$&,//,
),50,&87(6
02//,&87(6
Amber

Hartman
Tree from Woese. 1987.
Microbiological Reviews 51:221
rRNA Not Perfect
Nothing is Perfect
rRNA Phylogeny Copy # Correction
Kembel SW, Wu M,
Eisen JA, Green JL
(2012) Incorporating
16S Gene Copy
Number Information
Improves Estimates of
Microbial Diversity and
Abundance. PLoS
Comput Biol 8(10):
e1002743. doi:
10.1371/journal.pcbi.
1002743 Steven
Kembel
Jessica
Green
Martin
Wu
Tree Complications 1
!29
rRNA rRNArRNA
ACUGC
ACCUAU
CGUUCG
ACUCC
AGCUAU
CGAUCG
ACCCC
AGCUCU
CGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
R ACUCCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
F ACUCCAGGUAUCGAUCG
C ACCCCAGCUCUCGCUCG
W ACCCCAGCUCUGGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
C ACCCCAGCUCUCGCUCG
EuksBacteria Arch
Isolate Ribosomes
Arch
Tree Complications 2
!30
rRNA rRNArRNA
ACUGC
ACCUAU
CGUUCG
ACUCC
AGCUAU
CGAUCG
ACCCC
AGCUCU
CGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
R ACUCCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
F ACUCCAGGUAUCGAUCG
C ACCCCAGCUCUCGCUCG
W ACCCCAGCUCUGGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
C ACCCCAGCUCUCGCUCG
EuksBacteria Arch
Isolate Ribosomes
Arch
Tree Complications 3
!31
rRNA rRNArRNA
ACUGC
ACCUAU
CGUUCG
ACUCC
AGCUAU
CGAUCG
ACCCC
AGCUCU
CGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
R ACUCCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
F ACUCCAGGUAUCGAUCG
C ACCCCAGCUCUCGCUCG
W ACCCCAGCUCUGGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
C ACCCCAGCUCUCGCUCG
EuksBacteria Arch
Isolate Ribosomes
Arch
Automated Accurate Genome Tree
Lang JM, Darling AE, Eisen JA (2013) Phylogeny of
Bacterial and Archaeal Genomes Using Conserved
Genes: Supertrees and Supermatrices. PLoS ONE
8(4): e62510. doi:10.1371/journal.pone.0062510
Jenna
Lang
Aaron
Darling
AMPHORA
Martin
Wu
Metagenomics
metagenomics
ACUGC
ACCUAU
CGUUCG
ACUCC
AGCUAU
CGAUCG
ACCCC
AGCUCU
CGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
R ACUCCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
F ACUCCAGGUAUCGAUCG
C ACCCCAGCUCUCGCUCG
W ACCCCAGCUCUGGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
C ACCCCAGCUCUCGCUCG
EukaryotesBacteria Archaea
inputs of fixed carbon or nitrogen from external sources. As with
Leptospirillum group I, both Leptospirillum group II and III have the
genes needed to fix carbon by means of the Calvin–Benson–
Bassham cycle (using type II ribulose 1,5-bisphosphate carboxy-
lase–oxygenase). All genomes recovered from the AMD system
contain formate hydrogenlyase complexes. These, in combination
with carbon monoxide dehydrogenase, may be used for carbon
fixation via the reductive acetyl coenzyme A (acetyl-CoA) pathway
by some, or all, organisms. Given the large number of ABC-type
sugar and amino acid transporters encoded in the Ferroplasma type
Figure 4 Cell metabolic cartoons constructed from the annotation of 2,180 ORFs
identified in the Leptospirillum group II genome (63% with putative assigned function) and
1,931 ORFs in the Ferroplasma type II genome (58% with assigned function). The cell
cartoons are shown within a biofilm that is attached to the surface of an acid mine
drainage stream (viewed in cross-section). Tight coupling between ferrous iron oxidation,
pyrite dissolution and acid generation is indicated. Rubisco, ribulose 1,5-bisphosphate
carboxylase–oxygenase. THF, tetrahydrofolate.
articles
NATURE | doi:10.1038/nature02340 | www.nature.com/nature 5©2004 NaturePublishing Group
Metagenomics
metagenomics
ACUGC
ACCUAU
CGUUCG
ACUCC
AGCUAU
CGAUCG
ACCCC
AGCUCU
CGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
R ACUCCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
F ACUCCAGGUAUCGAUCG
C ACCCCAGCUCUCGCUCG
W ACCCCAGCUCUGGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
C ACCCCAGCUCUCGCUCG
Metagenomics
metagenomics
ACUGC
ACCUAU
CGUUCG
ACUCC
AGCUAU
CGAUCG
ACCCC
AGCUCU
CGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
R ACUCCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
F ACUCCAGGUAUCGAUCG
C ACCCCAGCUCUCGCUCG
W ACCCCAGCUCUGGCUCG
Taxa Characters
S ACUGCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
C ACCCCAGCUCUCGCUCG
Culture Independent “Metagenomics”
DNA DNADNA
!35
Taxa Characters
B1 ACTGCACCTATCGTTCG
B2 ACTCCACCTATCGTTCG
E1 ACTCCAGCTATCGATCG
E2 ACTCCAGGTATCGATCG
A1 ACCCCAGCTCTCGCTCG
A2 ACCCCAGCTCTGGCTCG
New1 ACCCCAGCTCTGCCTCG
New2 AGGGGAGCTCTGCCTCG
New3 ACTCCAGCTATCGATCG
New4 ACTGCACCTATCGTTCG
RecA RecARecA
http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.7
Genome Biology 2008, 9:R151
sequences are not conserved at the nucleotide level [29]. As a
result, the nr database does not actually contain many more
protein marker sequences that can be used as references than
those available from complete genome sequences.
Comparison of phylogeny-based and similarity-based phylotyping
Although our phylogeny-based phylotyping is fully auto-
mated, it still requires many more steps than, and is slower
than, similarity based phylotyping methods such as a
MEGAN [30]. Is it worth the trouble? Similarity based phylo-
typing works by searching a query sequence against a refer-
ence database such as NCBI nr and deriving taxonomic
information from the best matches or 'hits'. When species
that are closely related to the query sequence exist in the ref-
erence database, similarity-based phylotyping can work well.
However, if the reference database is a biased sample or if it
contains no closely related species to the query, then the top
hits returned could be misleading [31]. Furthermore, similar-
ity-based methods require an arbitrary similarity cut-off
value to define the top hits. Because individual bacterial
genomes and proteins can evolve at very different rates, a uni-
versal cut-off that works under all conditions does not exist.
As a result, the final results can be very subjective.
In contrast, our tree-based bracketing algorithm places the
query sequence within the context of a phylogenetic tree and
only assigns it to a taxonomic level if that level has adequate
sampling (see Materials and methods [below] for details of
the algorithm). With the well sampled species Prochlorococ-
cus marinus, for example, our method can distinguish closely
related organisms and make taxonomic identifications at the
species level. Our reanalysis of the Sargasso Sea data placed
672 sequences (3.6% of the total) within a P. marinus clade.
On the other hand, for sparsely sampled clades such as
Aquifex, assignments will be made only at the phylum level.
Thus, our phylogeny-based analysis is less susceptible to data
sampling bias than a similarity based approach, and it makes
Major phylotypes identified in Sargasso Sea metagenomic dataFigure 3
Major phylotypes identified in Sargasso Sea metagenomic data. The metagenomic data previously obtained from the Sargasso Sea was reanalyzed using
AMPHORA and the 31 protein phylogenetic markers. The microbial diversity profiles obtained from individual markers are remarkably consistent. The
breakdown of the phylotyping assignments by markers and major taxonomic groups is listed in Additional data file 5.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Alphaproteobacteria
Betaproteobacteria
G
am
m
aproteobacteria
D
eltaproteobacteria
Epsilonproteobacteria
U
nclassified
proteobacteria
Bacteroidetes
C
hlam
ydiae
C
yanobacteria
Acidobacteria
Therm
otogae
Fusobacteria
ActinobacteriaAquificae
Planctom
ycetes
Spirochaetes
Firm
icutes
C
hloroflexiC
hlorobi
U
nclassified
bacteria
dnaG
frr
infC
nusA
pgk
pyrG
rplA
rplB
rplC
rplD
rplE
rplF
rplK
rplL
rplM
rplN
rplP
rplS
rplT
rpmA
rpoB
rpsB
rpsC
rpsE
rpsI
rpsJ
rpsK
rpsM
rpsS
smpB
tsf
Relativeabundance
RpoB RpoBRpoB
Rpl4 Rpl4Rpl4 rRNA rRNArRNA
Hsp70 Hsp70Hsp70
EFTu EFTuEFTu
Many other genes
better than rRNA
AMPHORA
AMPHORA
Phylotyping w/ Protein Markers
AMPHORA
http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Alphaproteobacteria
Betaproteobacteria
G
am
m
aproteobacteria
D
eltaproteobacteria
Epsilonproteobacteria
U
nclassified
proteobacteria
Bacteroidetes
C
hlam
ydiae
C
yanobacteria
Acidobacteria
Therm
otogae
Fusobacteria
ActinobacteriaAquificae
Planctom
ycetes
Spirochaetes
Firm
icutes
C
hloroflexiC
hlorobi
U
nclassified
bacteria
dnaG
frr
infC
nusA
pgk
pyrG
rplA
rplB
rplC
rplD
rplE
rplF
rplK
rplL
rplM
rplN
rplP
rplS
rplT
rpmA
rpoB
rpsB
rpsC
rpsE
rpsI
rpsJ
rpsK
rpsM
rpsS
smpB
tsf
Relativeabundance
Martin Wu
GOS 1
GOS 2
GOS 3
GOS 4
GOS 5
Phylogenetic ID of Novel Lineages
Dongying 

Wu
Wu D, Wu M, Halpern A, Rusch DB,
Yooseph S, Frazier M, et al. (2011)
Stalking the Fourth Domain in
Metagenomic Data: Searching for,
Discovering, and Interpreting Novel, Deep
Branches in Marker Gene Phylogenetic
Trees. PLoS ONE 6(3): e18011. doi:
10.1371/journal.pone.0018011
Phylogenetic Diversity of Metagenomes
typically used as a qualitative measure because duplicate s
quences are usually removed from the tree. However, the
test may be used in a semiquantitative manner if all clone
even those with identical or near-identical sequences, are i
cluded in the tree (13).
Here we describe a quantitative version of UniFrac that w
call “weighted UniFrac.” We show that weighted UniFrac b
haves similarly to the FST test in situations where both a
FIG. 1. Calculation of the unweighted and the weighted UniFr
measures. Squares and circles represent sequences from two differe
environments. (a) In unweighted UniFrac, the distance between t
circle and square communities is calculated as the fraction of t
branch length that has descendants from either the square or the circ
environment (black) but not both (gray). (b) In weighted UniFra
branch lengths are weighted by the relative abundance of sequences
the square and circle communities; square sequences are weight
twice as much as circle sequences because there are twice as many tot
circle sequences in the data set. The width of branches is proportion
to the degree to which each branch is weighted in the calculations, an
gray branches have no weight. Branches 1 and 2 have heavy weigh
since the descendants are biased toward the square and circles, respe
tively. Branch 3 contributes no value since it has an equal contributio
from circle and square sequences after normalization.
Kembel SW, Eisen JA, Pollard KS, Green JL (2011) The Phylogenetic Diversity of
Metagenomes. PLoS ONE 6(8): e23214. doi:10.1371/journal.pone.0023214
Jessica
Green
Steven
Kembel
Katie
Pollard
Phylosift
Input Sequences
rRNA workflow
protein workflow
profile HMMs used to align
candidates to reference alignment
Taxonomic
Summaries
parallel option
hmmalign
multiple alignment
LAST
fast candidate search
pplacer
phylogenetic placement
LAST
fast candidate search
LAST
fast candidate search
search input against references
hmmalign
multiple alignment
hmmalign
multiple alignment
Infernal
multiple alignment
LAST
fast candidate search
<600 bp
>600 bp
Sample Analysis &
Comparison
Krona plots,
Number of reads placed
for each marker gene
Edge PCA,
Tree visualization,
Bayes factor tests
eachinputsequencescannedagainstbothworkflows
Aaron Darling
@koadman
Erik Matsen
@ematsen
Holly Bik
@hollybik
Guillaume Jospin
@guillaumejospin
Darling AE, Jospin G, Lowe E,
Matsen FA IV, Bik HM, Eisen JA.
(2014) PhyloSift: phylogenetic
analysis of genomes and
metagenomes. PeerJ 2:e243
http://dx.doi.org/10.7717/peerj.
243
Erik Lowe
Edge PCA: Identify
lineages that explain most
variation among samples
Edge PCA - Matsen and Evans 2013
Output: Edge PCA
Using Phylogeny 2: Functional Prediction
PHYLOGENENETIC PREDICTION OF GENE FUNCTION
IDENTIFY HOMOLOGS
OVERLAY KNOWN
FUNCTIONS ONTO TREE
INFER LIKELY FUNCTION
OF GENE(S) OF INTEREST
1 2 3 4 5 6
3 5
3
1A 2A 3A 1B 2B 3B
2A 1B
1A
3A
1B
2B
3B
ALIGN SEQUENCES
CALCULATE GENE TREE
1
2
4
6
CHOOSE GENE(S) OF INTEREST
2A
2A
5
3
Species 3Species 1 Species 2
1
1 2
2
2 31
1A 3A
1A 2A 3A
1A 2A 3A
4 6
4 5 6
4 5 6
2B 3B
1B 2B 3B
1B 2B 3B
ACTUAL EVOLUTION
(ASSUMED TO BE UNKNOWN)
Duplication?
EXAMPLE A EXAMPLE B
Duplication?
Duplication?
Duplication
5
METHOD
Ambiguous
Based on
Eisen, 1998
Genome Res 8:
163-167.
Phylogenomics
Overlaying Functions onto Tree
Aquae Trepa
Rat
Fly
Xenla
Mouse
Human
Yeast
Neucr
Arath
Borbu
Synsp
Neigo
Thema
Strpy
Bacsu
Ecoli
TheaqDeira
Chltr
Spombe
Yeast
Yeast
Spombe
Mouse
Human
Arath
Yeast
Human
Mouse
Arath
StrpyBacsu
Human
Celeg
Yeast
MetthBorbu
Aquae
Synsp
Deira Helpy
mSaco
Yeast
Celeg
Human
MSH4
MSH5
MutS2
MutS1
MSH1
MSH3
MSH6
MSH2
Based on Eisen, 1998

Nucl Acids Res 26: 4291-4300.
Phylogenomics ~~ Phylotyping
Eisen et al.
1992Eisen et al. 1992. J. Bact.174: 3416
Proteorhodopsin Functional Diversity
Venter et al., Science 304: 66. 2004
Shotmap
Simulate)
metagenomic)
library)
Translate)
metagenomic)
reads)
Search)
metagenomic)
pep6des)
Classify)
metagenomic)
pep6des)
Es6mate)
protein)family)
abundance)
Taxonomic)
profiles)from)real)
metagenomes)
Protein)family)
database)
IMG/ER)
reference)
genomes)
Construct))
mock))
community)
1"
Annotate)
genes)in)
genomes)
2"
Expected)
abundance)of)
gene)families)
3"
4"
5"
Protein)family)
database)
Evaluate)
es6ma6on)
accuracy)
6" 7"
8"
9"
Tom Sharpton
Katie Pollardhttps://github.com/sharpton/shotmap
dFunctional Prediction from Metagenomes
DNA DNADNA
!23
Taxa Characters
B1 ACTGCACCTATCGTTCG
B2 ACTCCACCTATCGTTCG
E1 ACTCCAGCTATCGATCG
E2 ACTCCAGGTATCGATCG
A1 ACCCCAGCTCTCGCTCG
A2 ACCCCAGCTCTGGCTCG
New1 ACCCCAGCTCTGCCTCG
New2 AGGGGAGCTCTGCCTCG
New3 ACTCCAGCTATCGATCG
New4 ACTGCACCTATCGTTCG
inputs of fixed carbon or nitrogen from external sources. As with
Leptospirillum group I, both Leptospirillum group II and III have the
genes needed to fix carbon by means of the Calvin–Benson–
Bassham cycle (using type II ribulose 1,5-bisphosphate carboxy-
lase–oxygenase). All genomes recovered from the AMD system
contain formate hydrogenlyase complexes. These, in combination
with carbon monoxide dehydrogenase, may be used for carbon
fixation via the reductive acetyl coenzyme A (acetyl-CoA) pathway
by some, or all, organisms. Given the large number of ABC-type
sugar and amino acid transporters encoded in the Ferroplasma type
Figure 4 Cell metabolic cartoons constructed from the annotation of 2,180 ORFs
identified in the Leptospirillum group II genome (63% with putative assigned function) and
1,931 ORFs in the Ferroplasma type II genome (58% with assigned function). The cell
cartoons are shown within a biofilm that is attached to the surface of an acid mine
drainage stream (viewed in cross-section). Tight coupling between ferrous iron oxidation,
pyrite dissolution and acid generation is indicated. Rubisco, ribulose 1,5-bisphosphate
carboxylase–oxygenase. THF, tetrahydrofolate.
articles
NATURE | doi:10.1038/nature02340 | www.nature.com/nature 5©2004 NaturePublishing Group
Phylogenetic Prediction of Function
• Many powerful and automated similarity based
methods for assigning genes to protein families
• COGs
• PFAM HMM searches
• Some limitations of similarity based methods can be
overcome by phylogenetic approaches
• Automated methods now available
• Sean Eddy
• Steven Brenner
• Kimmen Sjölander
Phylogenetic Prediction of Function
• Many powerful and automated similarity based
methods for assigning genes to protein families
• COGs
• PFAM HMM searches
• Some limitations of similarity based methods can be
overcome by phylogenetic approaches
• Automated methods now available
• Sean Eddy
• Steven Brenner
• Kimmen Sjölander
• But …
Carboxydothermus hydrogenoformans
• Isolated from a Russian hotspring
• Thermophile (grows at 80°C)
• Anaerobic
• Grows very efficiently on CO (Carbon
Monoxide)
• Produces hydrogen gas
• Low GC Gram positive (Firmicute)
• Genome Determined (Wu et al. 2005
PLoS Genetics 1: e65. )
Homologs of Sporulation Genes
Wu et al. 2005 PLoS
Genetics 1: e65.
Carboxydothermus sporulates
Wu et al. 2005 PLoS Genetics 1: e65.
Non-Homology Predictions:
Phylogenetic Profiling
• Step 1: Search all genes in
organisms of interest against all
other genomes
• Ask: Yes or No, is each gene
found in each other species
• Cluster genes by distribution
patterns (profiles)
Sporulation Gene Profile
Wu et al. 2005 PLoS Genetics 1: e65.
B. subtilis new sporulation genes
J Bacteriol. 2013 Jan;195(2):253-60. doi: 10.1128/JB.01778-12
Bjorn Traag
Richard Losick
Phylogenetic Profiling for Metagenomics?
Using Phylogeny 3: Linking Function and Phylogeny
HiC Crosslinking & Sequencing
Beitel CW, Froenicke L, Lang JM, Korf IF, Michelmore
RW, Eisen JA, Darling AE. (2014) Strain- and plasmid-
level deconvolution of a synthetic metagenome by
sequencing proximity ligation products. PeerJ 2:e415
http://dx.doi.org/10.7717/peerj.415
Table 1 Species alignment fractions. The number of reads aligning to each replicon present in the
synthetic microbial community are shown before and after filtering, along with the percent of total
constituted by each species. The GC content (“GC”) and restriction site counts (“#R.S.”) of each replicon,
species, and strain are shown. Bur1: B. thailandensis chromosome 1. Bur2: B. thailandensis chromosome
2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2: L. brevis plasmid 2, Ped: P. pentosaceus,
K12: E. coli K12 DH10B, BL21: E. coli BL21. An expanded version of this table can be found in Table S2.
Sequence Alignment % of Total Filtered % of aligned Length GC #R.S.
Lac0 10,603,204 26.17% 10,269,562 96.85% 2,291,220 0.462 629
Lac1 145,718 0.36% 145,478 99.84% 13,413 0.386 3
Lac2 691,723 1.71% 665,825 96.26% 35,595 0.385 16
Lac 11,440,645 28.23% 11,080,865 96.86% 2,340,228 0.46 648
Ped 2,084,595 5.14% 2,022,870 97.04% 1,832,387 0.373 863
BL21 12,882,177 31.79% 2,676,458 20.78% 4,558,953 0.508 508
K12 9,693,726 23.92% 1,218,281 12.57% 4,686,137 0.507 568
E. coli 22,575,903 55.71% 3,894,739 17.25% 9,245,090 0.51 1076
Bur1 1,886,054 4.65% 1,797,745 95.32% 2,914,771 0.68 144
Bur2 2,536,569 6.26% 2,464,534 97.16% 3,809,201 0.672 225
Bur 4,422,623 10.91% 4,262,279 96.37% 6,723,972 0.68 369
Figure 1 Hi-C insert distribution. The distribution of genomic distances between Hi-C read pairs is
shown for read pairs mapping to each chromosome. For each read pair the minimum path length on
the circular chromosome was calculated and read pairs separated by less than 1000 bp were discarded.
The 2.5 Mb range was divided into 100 bins of equal size and the number of read pairs in each bin
was recorded for each chromosome. Bin values for each chromosome were normalized to sum to 1 and
plotted.
E. coli K12 genome were distributed in a similar manner as previously reported (Fig. 1;
(Lieberman-Aiden et al., 2009)). We observed a minor depletion of alignments spanning
the linearization point of the E. coli K12 assembly (e.g., near coordinates 0 and 4686137)
due to edge eVects induced by BWA treating the sequence as a linear chromosome rather
than circular.
10.7717/peerj.415 9/19
Figure 2 Metagenomic Hi-C associations. The log-scaled, normalized number of Hi-C read pairs
associating each genomic replicon in the synthetic community is shown as a heat map (see color scale,
blue to yellow: low to high normalized, log scaled association rates). Bur1: B. thailandensis chromosome
1. Bur2: B. thailandensis chromosome 2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2:
L. brevis plasmid 2, Ped: P. pentosaceus, K12: E. coli K12 DH10B, BL21: E. coli BL21.
reference assemblies of the members of our synthetic microbial community with the same
alignment parameters as were used in the top ranked clustering (described above). We first
Figure 3 Contigs associated by Hi-C reads. A graph is drawn with nodes depicting contigs and edges
depicting associations between contigs as indicated by aligned Hi-C read pairs, with the count thereof
depicted by the weight of edges. Nodes are colored to reflect the species to which they belong (see legend)
with node size reflecting contig size. Contigs below 5 kb and edges with weights less than 5 were excluded.
Contig associations were normalized for variation in contig size.
typically represent the reads and variant sites as a variant graph wherein variant sites are
represented as nodes, and sequence reads define edges between variant sites observed in
the same read (or read pair). We reasoned that variant graphs constructed from Hi-C
data would have much greater connectivity (where connectivity is defined as the mean
path length between randomly sampled variant positions) than graphs constructed from
mate-pair sequencing data, simply because Hi-C inserts span megabase distances. Such
Figure 4 Hi-C contact maps for replicons of Lactobacillus brevis. Contact maps show the number of
Hi-C read pairs associating each region of the L. brevis genome. The L. brevis chromosome (Lac0, (A),
Chris Beitel
@datscimed
Aaron Darling
@koadman
Pink Berries
PB-PSB1
(Purple sulfur bacteria)
PB-SRB1
(Sulfate reducing bacteria)
(sulfate)
(sulfide)
Wilbanks, E.G. et al (2014). Environmental Microbiology
Lizzy Wilbanks
@lizzywilbanks
Long Reads Help, A Lot
Hiseq & Miseq
100-250 bp
Moleculo
2-20 kb
Pacbio RSII
2-20kb
Micky Kertesz,
Tim Blauwcamp
Meredith Ashby
Cheryl Heiner
Illumina-based
“synthetic long
reads”
Real-time single
molecule
sequencing
(p4-c2, p5-c3)
295 Megabases 474 Megabases61 Gigabases
Using Phylogeny 4: Better Reference Data
PhyEco Markers
Phylogenetic group Genome Number Gene Number Maker Candidates
Archaea 62 145415 106
Actinobacteria 63 267783 136
Alphaproteobacteria 94 347287 121
Betaproteobacteria 56 266362 311
Gammaproteobacteria 126 483632 118
Deltaproteobacteria 25 102115 206
Epislonproteobacteria 18 33416 455
Bacteriodes 25 71531 286
Chlamydae 13 13823 560
Chloroflexi 10 33577 323
Cyanobacteria 36 124080 590
Firmicutes 106 312309 87
Spirochaetes 18 38832 176
Thermi 5 14160 974
Thermotogae 9 17037 684
Wu D, Jospin G, Eisen JA (2013) Systematic Identification of Gene Families
for Use as “Markers” for Phylogenetic and Phylogeny-Driven Ecological
Studies of Bacteria and Archaea and Their Major Subgroups. PLoS ONE
8(10): e77033. doi:10.1371/journal.pone.0077033
Better Protein Families
Representative
Genomes
Extract
Protein
Annotation
All v. All
BLAST
Homology
Clustering
(MCL)
SFams
Align &
Build
HMMs
HMMs
Screen for
Homologs
New
Genomes
Extract
Protein
Annotation
Figure 1
Sharpton et al. 2012.BMC bioinformatics,
13(1), 264.
A
B
C
Diverse Reference Genomes
Microbial Dark Matter Part 2
• Ramunas
Stepanauskas
• Tanja Woyke
• Jonathan Eisen
• Duane Moser
• Tullis Onstott
Talk by J. Eisen for NZ Computational Genomics meeting
Phylogeny Isn’t Everything .. Model Systems
Wu et al. 2006 PLoS Biology 4: e188.
Baumannia makes vitamins and cofactors
Sulcia makes amino acids
Simple Symbioses
Wu et al. 2006 PLoS Biology 4: e188.
Baumannia makes vitamins and cofactors
Sulcia makes amino acids
Phylogenetic Binning
Nancy Moran
Dongying Wu
Drosophila microbiome w/ Kopp Lab
Both natural surveys and laboratory
experiments indicate that host diet
plays a major role in shaping the
Drosophila bacterial microbiome.
Laboratory strains provide only a
limited model of natural host–microbe
interactions
Jenna Lang Angus Chandler
Rice Microbiome w/ Sundar Lab
Edwards et al. 2015. Structure, variation,
and assembly of the root-associated
microbiomes of rice. PNAS
9
Supplementary Figures31
32
Fig. S1 Map depicting soil collection locations for greenhouse experiment.33
10
234
Fig. S2. Sampling and collection of the rhizocompartments. Roots are collected from rice235
plants and soil is shaken off the roots to leave ~1mm of soil around the roots. The ~1 mm of soil236
three separate rhizocompartments: the rhizosphere, rhizoplane,
and endosphere (Fig. 1A). Because the root microbiome has
been shown to correlate with the developmental stage of the
plant (10), the root-associated microbial communities were
sampled at 42 d (6 wk), when rice plants from all genotypes were
well-established in the soil but still in their vegetative phase of
growth. For our study, the rhizosphere compartment was com-
w
i
t
i
(
t
s
z
i
m
a
r
t
t
(
t
m
P
h
t
P
p
(
i
M
P
a
t
o
s
q
a
n
v
v
p
t
p
s
G
Fig. 1. Root-associated microbial communities are separable by rhizo-
compartment and soil type. (A) A representation of a rice root cross-section
depicting the locations of the microbial communities sampled. (B) Within-
sample diversity (α-diversity) measurements between rhizospheric compart-
ments indicate a decreasing gradient in microbial diversity from the rhizo-
sphere to the endosphere independent of soil type. Estimated species
richness was calculated as eShannon_entropy
. The horizontal bars within boxes
represent median. The tops and bottoms of boxes represent 75th and 25th
quartiles, respectively. The upper and lower whiskers extend 1.5× the
interquartile range from the upper edge and lower edge of the box, re-
spectively. All outliers are plotted as individual points. (C) PCoA using the
WUF metric indicates that the largest separation between microbial com-
munities is spatial proximity to the root (PCo 1) and the second largest
source of variation is soil type (PCo 2). (D) Histograms of phyla abundances in
each compartment and soil. B, bulk soil; E, endosphere; P, rhizoplane; S,
rhizosphere; Sac, Sacramento.
2 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1414592112
gate the relationship between rice ge-
icrobiome, domesticated rice varieties
rated growing regions were tested. Six
spanning two species within the Oryza
2 d in the greenhouse before sampling.
a) cultivars M104, Nipponbare (both
ties), IR50, and 93-11 (both indica va-
gside two cultivars of African cultivated
g7102 (Glab B) and TOg7267 (Glab E).
ed that rice genotype accounted for
ariation between microbial communities
% of the variance, P < 0.001; Dataset
f the variance, P < 0.066; Dataset S5H);
ntations for clustering patterns of the
nt on the first two axes of unconstrained
ppendix, Fig. S10). We then used CAP
ffect of rice genotype on the microbial
g on rice cultivar and controlling for
and technical factors, we found that ge-
ice have a significant effect on root-
mmunities (5.1%, P = 0.005, WUF, Fig.
, UUF, SI Appendix, Fig. S11A). Ordi-
AP analysis revealed clustering patterns
only partially consistent with genetic
F and UUF metrics. The two japonica
her and the two O. glaberrima cultivars
ver, the indica cultivars were split, with
O. glaberrima cultivars and IR50 clus-
cultivars.
enotypic effect manifests in individual
eparated the whole dataset to focus on
vidually and conducted CAP analysis
and technical factors. The rhizosphere
eight sites were operated under two cultivation practices: organic
cultivation and a more conventional cultivation practice termed
“ecofarming” (see below). Because genotype explained the least
variance in the greenhouse data, we limited the analysis to one
cultivar, S102, a California temperate japonica variety that is
widely cultivated by commercial growers and is closely related to
M104 (26). Field samples were collected from vegetatively
growing rice plants in flooded fields and the previously defined
rhizocompartments were analyzed as before. Unfortunately,
collection of bulk soil controls for the field experiment was not
Fig. 3. Host plant genotype significantly affects microbial communities in
the rhizospheric compartments. (A) Ordination of CAP analysis using the
WUF metric constrained to rice genotype. (B) Within-sample diversity
measurements of rhizosphere samples of each cultivar grown in each soil.
Estimated species richness was calculated as eShannon_entropy
. The horizontal
bars within boxes represent median. The tops and bottoms of boxes repre-
sent 75th and 25th quartiles, respectively. The upper and lower whiskers
extend 1.5× the interquartile range from the upper edge and lower edge of
the box, respectively. All outliers are plotted as individual points.
oi/10.1073/pnas.1414592112 Edwards et al.
fields are too high to find representative soil that is unlikely to
be affected by nearby plants. Amplification and sequencing of
the field microbiome samples yielded 13,349,538 high-quality
sequences (median: 54,069 reads per sample; range: 12,535–
148,233 reads per sample; Dataset S13). The sequences were
clustered into OTUs using the same criteria as the greenhouse
experiment, yielding 222,691 microbial OTUs and 47,983 OTUs
with counts >5 across the field dataset.
We found that the microbial diversity of field rice plants is
significantly influenced by the field site. α-Diversity measure-
ments of the field rhizospheres indicated that the cultivation site
significantly impacts microbial diversity (SI Appendix, Fig. S14A,
P = 2.00E-16, ANOVA and Dataset S14). Unconstrained PCoA
using both the WUF and UUF metrics showed that microbial
communities separated by field site across the first axis (Fig. 4B,
WUF and SI Appendix, Fig. S14B, UUF). PERMANOVA agreed
with the unconstrained PCoA in that field site explained the
largest proportion of variance between the microbial communi-
ties for field plants (30.4% of variance, P < 0.001, WUF, Dataset
S5O and 26.6% of variance, P < 0.001, UUF, Dataset S5P). CAP
analysis constrained to field site and controlled for rhizocom-
partment, cultivation practice, and technical factors (sequencing
batch and biological replicate) agreed with the PERMANOVA
results in that the field site explains the largest proportion of
variance between the root-associated microbial communities in
field plants (27.3%, P = 0.005, WUF, SI Appendix, Fig. S15A
and 28.9%, P = 0.005, UUF, SI Appendix, Fig. S15E), sug-
gesting that geographical factors may shape root-associated
microbial communities.
Rhizospheric Compartmentalization Is Retained in Field Plants. Sim-
ilar to the greenhouse plants, the rhizospheric microbiomes of
field plants are distinguishable by compartment. α-Diversity of
the field plants again showed that the rhizosphere had the
highest microbial diversity, whereas the endosphere had the least
S15). PCoA
the WUF a
compartmen
Appendix, F
separation i
ond largest
(20.76%, P
UUF, Data
biomes cons
trolled for f
agreed with
variance bet
compartmen
and 10.9%,
Taxonomi
overall sim
Chloroflexi,
microbiota.
endosphere
Proteobacteri
and Plancto
distribution
trend from t
Appendix, Fi
We again
OTUs in the
S16). We fo
endosphere c
representing
Fig. S17). Th
the genus A
and Alphap
terestingly, 1
found to b
greenhouse
OTUs were
sisted of tax
and Myxoco
bidopsis roo
Cultivation Pr
The rice fiel
practices, org
tion called
farming in th
are all perm
harvest fumi
itself does si
partments ov
a significant
the rhizocom
indicating th
affected diffe
the rhizosph
practice, with
zospheres th
Dataset S14)
crobial comm
tests; Datase
practices are
the WUF m
S14D). PERFig. 4. Root-associated microbiomes from field-grown plants are separable
by cultivation site, rhizospheric compartment, and cultivation practice. (A)
Variation w/in Plant
Cultivation Site Effects
Rice Genotype Effects
and mitochondrial) reads to analyze microbial abundance in
the endosphere over time (Fig. 6A). Using this technique, we
confirmed the sterility of seedling roots before transplantation.
We found that microbial penetrance into the endosphere oc-
curred at or before 24 h after transplantation and that the pro-
portion of microbial reads to organellar reads increased over the
first 2 wk after transplantation (Fig. 6A). To further support the
evidence for microbiome acquisition within the first 24 h, we
sampled root endospheric microbiomes from sterilely germi-
nated seedlings before transplanting into Davis field soil as well
as immediately after transplantation and 24 h after transplan-
tation (SI Appendix, Fig. S24). The root endospheres of sterilely
germinated seedlings, as well as seedlings transplanted into
Davis field soil for 1 min, both had a very low percentage of
microbial reads compared with organellar reads (0.22% and
0.71%), with the differences not statistically significant (P = 0.1,
Wilcoxon test). As before, endospheric microbial abundance
increased significantly, by >10-fold after 24 h in field soil (3.95%,
P = 0.05, Wilcoxon test). We conclude that brief soil contact
does not strongly increase the proportion of microbial reads, and
therefore the increase in microbial reads at 24 h is indicative of
endophyte acquisition within 1 d after transplantation.
α-Diversity significantly varied by rhizocompartment (P < 2E-
16; Dataset S23) and there was a significant interaction between
rhizocompartment and collection time (P = 0.042; Dataset S23);
however, when each rhizocompartment was analyzed individ-
ually, the bulk soil was the only compartment that showed
(13 d) approach the endosphere and rhizoplane microbiome
compositions for plants that have been grown in the green-
house for 42 d.
There are slight shifts in the distribution of phyla over time;
however, there are significant distinctions between the com-
partments starting as early as 24 h after transplantation into soil
(Fig. 6D, SI Appendix, Figs. S24B and S26, and Dataset S24).
Because each phylum consists of diverse OTUs that could ex-
hibit very different behaviors during acquisition, we next ex-
amined the dynamics and colonization patterns of specific
OTUs within the time-course experiment. The core set of 92
endosphere-enriched OTUs obtained from the previous green-
house experiment (SI Appendix, Fig. S9C) was analyzed for
relative abundances at different time points (Fig. 6E). Of the 92
core endosphere-enriched microbes present in the greenhouse
experiment, 53 OTUs were detectable in the endosphere in the
time-course experiment. The average abundance profile over
time revealed a colonization pattern for the core endospheric
microbiome. Relative abundance of the core endosphere-
enriched microbiome peaks early (3 d) in the rhizosphere and
then decreases back to a steady, low level for the remainder of
the time points. Similarly, the rhizoplane profile shows an in-
crease after 3 d with a peak at 8 d with a decline at 13 d. The
endosphere generally follows the rhizoplane profile, except that
relative abundance is still increasing at 13 d. These results sug-
gest that the core endospheric microbes are first attracted to the
rhizosphere and then locate to the rhizoplane, where they attach
Fig. 5. OTU coabundance network reveals modules of OTUs associated with methane cycling. (A) Subset of the entire network corresponding to 11
modules with methane cycling potential. Each node represents one OTU and an edge is drawn between OTUs if they share a Pearson correlation of
greater than or equal to 0.6. (B) Depiction of module 119 showing the relationship between methanogens, syntrophs, methanotrophs, and other
methane cycling taxonomies. Each node represents one OTU and is labeled by the presumed function of that OTU’s taxonomy in methane cycling. An
edge is drawn between two OTUs if they have a Pearson correlation of greater than or equal to 0.6. (C) Mean abundance profile for OTUs in module 119
across all rhizocompartments and field sites. The position along the x axis corresponds to a different field site. Error bars represent SE. The x and y axes
represent no particular scale.
PLANTBIOLOGYPNASPLUS
Function x Genotype
of magnitude greater than in any single plant species to date.
Under controlled greenhouse conditions, the rhizocompartments
described the largest source of variation in the microbial com-
munities sampled (Dataset S5A). The pattern of separation be-
tween the microbial communities in each compartment is
consistent with a spatial gradient from the bulk soil across the
rhizosphere and rhizoplane into the endosphere (Fig. 1C).
Similarly, microbial diversity patterns within samples hold the
same pattern where there is a gradient in α-diversity from the
rhizosphere to the endosphere (Fig. 1B). Enrichment and de-
pletion of certain microbes across the rhizocompartments indi-
cates that microbial colonization of rice roots is not a passive
process and that plants have the ability to select for certain mi-
crobial consortia or that some microbes are better at filling the
root colonizing niche. Similar to studies in Arabidopsis, we found
that the relative abundance of Proteobacteria is increased in the
endosphere compared with soil, and that the relative abundances
of Acidobacteria and Gemmatimonadetes decrease from the soil
to the endosphere (9–11), suggesting that the distribution of
different bacterial phyla inside the roots might be similar for all
land plants (Fig. 1D and Dataset S6). Under controlled green-
house conditions, soil type described the second largest source
of variation within the microbial communities of each sample.
However, the soil source did not affect the pattern of separation
between the rhizospheric compartments, suggesting that the
rhizocompartments exert a recruitment effect on microbial con-
sortia independent of the microbiome source.
By using differential OTU abundance analysis in the com-
partments, we observed that the rhizosphere serves an enrich-
ment role for a subset of microbial OTUs relative to bulk soil
(Fig. 2). Further, the majority of the OTUs enriched in the
rhizosphere are simultaneously enriched in the rhizoplane and/or
endosphere of rice roots (Fig. 2B and SI Appendix, Fig. S16B),
consistent with a recruitment model in which factors produced by
the root attract taxa that can colonize the endosphere. We found
that the rhizoplane, although enriched for OTUs that are also
Time Series
Acknowledgements
DOE JGI Sloan GBMF NSF
DHS DARPA
Aaron Darling

Lizzy
Wilbanks
Jenna Lang Russell
Neches
Rob Knight
Jack Gilbert Tanja Woyke Rob Dunn
Katie Pollard
Jessica
Green
Darlene
Cavalier
Eddy RubinWendy Brown
Dongying Wu
Phil
Hugenholtz
DSMZ
Sundar
Srijak
Bhatnagar David Coil
Alex Alexiev
Hannah
Holland-Moritz
Holly Bik
John Zhang
Holly
Menninger
Guillaume
Jospin
David Lang
Cassie
Ettinger
Tim HarkinsJennifer Gardy
Holly Ganz

More Related Content

PPTX
CCBC tutorial beiko
PPTX
Beiko taconic-nov3
PPTX
Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...
PPT
The Emerging Global Community of Microbial Metagenomics Researchers
PDF
Targeted RNA Sequencing, Urban Metagenomics, and Astronaut Genomics
PDF
Introduction to Metagenomics. Applications, Approaches and Tools (Bioinformat...
PPTX
Analysis of binning tool in metagenomics
PPT
Microbial Metagenomics Drives a New Cyberinfrastructure
CCBC tutorial beiko
Beiko taconic-nov3
Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...
The Emerging Global Community of Microbial Metagenomics Researchers
Targeted RNA Sequencing, Urban Metagenomics, and Astronaut Genomics
Introduction to Metagenomics. Applications, Approaches and Tools (Bioinformat...
Analysis of binning tool in metagenomics
Microbial Metagenomics Drives a New Cyberinfrastructure

What's hot (20)

PDF
Introduction to 16S Microbiome Analysis
PPTX
Introduction to Proteogenomics
PPT
BioMinds Poster!!!!!!!!
PPTX
metagenomics
PPT
Reframing Phylogenomics
PPTX
Metagenomics: An overview
PPTX
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
PPT
CAMERA Presentation at KNAW ICoMM Colloquium May 2008
PPT
Advancing the Metagenomics Revolution
PDF
Metagenomics as a tool for biodiversity and health
PPT
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
PPT
Creating a Cyberinfrastructure for Advanced Marine Microbial Ecology Research...
PDF
16S rRNA Analysis using Mothur Pipeline
PPT
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PPTX
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
PPTX
Viral Metagenomics (CABBIO 20150629 Buenos Aires)
PPT
Folker Meyer: Metagenomic Data Annotation
PDF
Metagenomics and Industrial Application
PPTX
Metagenomics and it’s applications
PPTX
[2013.09.27] extracting genomes from metagenomes
Introduction to 16S Microbiome Analysis
Introduction to Proteogenomics
BioMinds Poster!!!!!!!!
metagenomics
Reframing Phylogenomics
Metagenomics: An overview
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
CAMERA Presentation at KNAW ICoMM Colloquium May 2008
Advancing the Metagenomics Revolution
Metagenomics as a tool for biodiversity and health
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Creating a Cyberinfrastructure for Advanced Marine Microbial Ecology Research...
16S rRNA Analysis using Mothur Pipeline
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Viral Metagenomics (CABBIO 20150629 Buenos Aires)
Folker Meyer: Metagenomic Data Annotation
Metagenomics and Industrial Application
Metagenomics and it’s applications
[2013.09.27] extracting genomes from metagenomes
Ad

Viewers also liked (20)

PPTX
Managing environmental- molecular- and associated meta-data: The Micro B3 Inf...
PPTX
The Ocean Sampling Day's Metagenome Analysis: Standards, Pipelines and First ...
PDF
Sophie F. summer Poster Final
PPTX
[2013.12.02] Mads Albertsen: Extracting Genomes from Metagenomes
PPTX
16S classifier
PPTX
16S Ribosomal DNA Sequence Analysis
PPT
New Generation Sequencing Technologies: an overview
PPTX
Future of metagenomics
PPTX
Bacterial Identification by 16s rRNA Sequencing.ppt
PPTX
[13.07.07] albertsen mewe13 metagenomics
PPTX
Novel Computational Approaches to Investigate Microbial Diversity
PDF
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
PPTX
Next generation sequencing
PPTX
Metagenomics newer approach in understanding Microbes
PPTX
Metagenomics
PDF
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
PPTX
Ngs microbiome
PPT
Metagenomics sequencing
PPTX
Next Gen Sequencing (NGS) Technology Overview
PPTX
Metagenomics
Managing environmental- molecular- and associated meta-data: The Micro B3 Inf...
The Ocean Sampling Day's Metagenome Analysis: Standards, Pipelines and First ...
Sophie F. summer Poster Final
[2013.12.02] Mads Albertsen: Extracting Genomes from Metagenomes
16S classifier
16S Ribosomal DNA Sequence Analysis
New Generation Sequencing Technologies: an overview
Future of metagenomics
Bacterial Identification by 16s rRNA Sequencing.ppt
[13.07.07] albertsen mewe13 metagenomics
Novel Computational Approaches to Investigate Microbial Diversity
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
Next generation sequencing
Metagenomics newer approach in understanding Microbes
Metagenomics
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Ngs microbiome
Metagenomics sequencing
Next Gen Sequencing (NGS) Technology Overview
Metagenomics
Ad

Similar to Talk by J. Eisen for NZ Computational Genomics meeting (20)

PDF
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...
PPTX
Whole genome sequencing of bacteria & analysis
PDF
Diversity Diversity Diversity Diversity ....
PPTX
Toolbox for bacterial population analysis using NGS
PDF
MIB200A at UCDavis Module: Microbial Phylogeny; Class 3
PDF
EVE161: Microbial Phylogenomics - Class 1 - Introduction
PDF
EVE 161 Winter 2018 Class 18
PPT
Phylogenomic methods for comparative evolutionary biology - University Colleg...
PDF
"Phylogeny-Driven Approaches to Genomics and Metagenomics" talk by Jonathan E...
PPTX
GLBIO/CCBC Metagenomics Workshop
PDF
Processing Amplicon Sequence Data for the Analysis of Microbial Communities
PDF
NGS Applications II (UEB-UAT Bioinformatics Course - Session 2.1.3 - VHIR, Ba...
PDF
Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
PPTX
Molecular basis of evolution and softwares used in phylogenetic tree contruction
PDF
EVE 161 Lecture 6
PPTX
Phylotastic metagenomics
PPTX
Amplicon sequencing slides - Trina McMahon - MEWE 2013
PDF
Next Generation Sequencing & Transcriptome Analysis
PPTX
ECCMID 2015 - So I have sequenced my genome ... what now?
PDF
Phylogeny-Driven Approaches to Genomics and Metagenomics - talk by Jonathan E...
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...
Whole genome sequencing of bacteria & analysis
Diversity Diversity Diversity Diversity ....
Toolbox for bacterial population analysis using NGS
MIB200A at UCDavis Module: Microbial Phylogeny; Class 3
EVE161: Microbial Phylogenomics - Class 1 - Introduction
EVE 161 Winter 2018 Class 18
Phylogenomic methods for comparative evolutionary biology - University Colleg...
"Phylogeny-Driven Approaches to Genomics and Metagenomics" talk by Jonathan E...
GLBIO/CCBC Metagenomics Workshop
Processing Amplicon Sequence Data for the Analysis of Microbial Communities
NGS Applications II (UEB-UAT Bioinformatics Course - Session 2.1.3 - VHIR, Ba...
Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Molecular basis of evolution and softwares used in phylogenetic tree contruction
EVE 161 Lecture 6
Phylotastic metagenomics
Amplicon sequencing slides - Trina McMahon - MEWE 2013
Next Generation Sequencing & Transcriptome Analysis
ECCMID 2015 - So I have sequenced my genome ... what now?
Phylogeny-Driven Approaches to Genomics and Metagenomics - talk by Jonathan E...

More from Jonathan Eisen (20)

PDF
Eisen.CentralValley2024.pdf
PDF
Phylogenomics and the Diversity and Diversification of Microbes
PDF
Talk by Jonathan Eisen for LAMG2022 meeting
PDF
Thoughts on UC Davis' COVID Current Actions
PDF
Phylogenetic and Phylogenomic Approaches to the Study of Microbes and Microbi...
PDF
A Field Guide to Sars-CoV-2
PDF
EVE198 Summer Session Class 4
PDF
EVE198 Summer Session 2 Class 1
PDF
EVE198 Summer Session 2 Class 2 Vaccines
PDF
EVE198 Spring2021 Class1 Introduction
PDF
EVE198 Spring2021 Class2
PDF
EVE198 Spring2021 Class5 Vaccines
PDF
EVE198 Winter2020 Class 8 - COVID RNA Detection
PDF
EVE198 Winter2020 Class 1 Introduction
PDF
EVE198 Winter2020 Class 3 - COVID Testing
PDF
EVE198 Winter2020 Class 5 - COVID Vaccines
PDF
EVE198 Winter2020 Class 9 - COVID Transmission
PDF
EVE198 Fall2020 "Covid Mass Testing" Class 8 Vaccines
PDF
EVE198 Fall2020 "Covid Mass Testing" Class 2: Viruses, COIVD and Testing
PDF
EVE198 Fall2020 "Covid Mass Testing" Class 1 Introduction
Eisen.CentralValley2024.pdf
Phylogenomics and the Diversity and Diversification of Microbes
Talk by Jonathan Eisen for LAMG2022 meeting
Thoughts on UC Davis' COVID Current Actions
Phylogenetic and Phylogenomic Approaches to the Study of Microbes and Microbi...
A Field Guide to Sars-CoV-2
EVE198 Summer Session Class 4
EVE198 Summer Session 2 Class 1
EVE198 Summer Session 2 Class 2 Vaccines
EVE198 Spring2021 Class1 Introduction
EVE198 Spring2021 Class2
EVE198 Spring2021 Class5 Vaccines
EVE198 Winter2020 Class 8 - COVID RNA Detection
EVE198 Winter2020 Class 1 Introduction
EVE198 Winter2020 Class 3 - COVID Testing
EVE198 Winter2020 Class 5 - COVID Vaccines
EVE198 Winter2020 Class 9 - COVID Transmission
EVE198 Fall2020 "Covid Mass Testing" Class 8 Vaccines
EVE198 Fall2020 "Covid Mass Testing" Class 2: Viruses, COIVD and Testing
EVE198 Fall2020 "Covid Mass Testing" Class 1 Introduction

Recently uploaded (20)

PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
Pharmacology of Autonomic nervous system
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PPT
protein biochemistry.ppt for university classes
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PPTX
Fluid dynamics vivavoce presentation of prakash
PDF
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PPTX
Overview of calcium in human muscles.pptx
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
2. Earth - The Living Planet Module 2ELS
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
C1 cut-Methane and it's Derivatives.pptx
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
Placing the Near-Earth Object Impact Probability in Context
Classification Systems_TAXONOMY_SCIENCE8.pptx
. Radiology Case Scenariosssssssssssssss
Pharmacology of Autonomic nervous system
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
protein biochemistry.ppt for university classes
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
Fluid dynamics vivavoce presentation of prakash
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
POSITIONING IN OPERATION THEATRE ROOM.ppt
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
Overview of calcium in human muscles.pptx
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
2. Earth - The Living Planet Module 2ELS
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
C1 cut-Methane and it's Derivatives.pptx
TOTAL hIP ARTHROPLASTY Presentation.pptx
Placing the Near-Earth Object Impact Probability in Context

Talk by J. Eisen for NZ Computational Genomics meeting

  • 1. Phylogeny driven approaches to the study of microbial diversity September 3, 2015 Queenstown Computational Genomics Conference Jonathan A. Eisen @phylogenomics University of California, Davis
  • 2. 0 1000 2000 3000 4000 00 01 02 03 04 05 06 07 08 09 10 11 12 13 Pubmed “Microbiome” Hits The Rise of the Microbiome
  • 3. microBIOME or microbiOME • microbi-OME • collection of genomes of microbes from a community (emphasis on OME) • micro-BIOME • a community of microbes (emphasis on BIOME) • see http://tinyurl.com/definemicrobiome
  • 4. Not Just About Humans or Hosts
  • 6. Why Now I: Appreciation of Microbial Diversity Functional Diversity Diversity of Form Phylogenetic Diversity
  • 7. Why Now I: Appreciation of Microbial Diversity Functional Diversity Diversity of Form Phylogenetic Diversity MICROBES RUN THE PLANET
  • 8. Why Now II: Post Genome Blues The Microbiome Transcriptome VariomeEpigenome Overselling the Human Genome?
  • 10. Why Now IV: Sequencing Has Gone Crazy
  • 11. Sequencing Revolution !10 •More genes and genomes •Deeper sequencing • The rare biosphere • Relative abundance estimates •More samples (with barcoding) • Times series • Spatially diverse sampling • Fine scale sampling
  • 12. Turnbaugh et al Nature. 2006 444(7122):1027-31. Why Now V: Microbiome Functions
  • 13. Uses of Phylogeny 1: Species Phylogeny
  • 14. Woese: Classification of Cultured Taxa by rRNA !13 rRNA rRNArRNA ACUGC ACCUAU CGUUCG ACUCC AGCUAU CGAUCG ACCCC AGCUCU CGCUCG Taxa Characters S ACUGCACCUAUCGUUCG R ACUCCACCUAUCGUUCG E ACUCCAGCUAUCGAUCG F ACUCCAGGUAUCGAUCG C ACCCCAGCUCUCGCUCG W ACCCCAGCUCUGGCUCG Taxa Characters S ACUGCACCUAUCGUUCG E ACUCCAGCUAUCGAUCG C ACCCCAGCUCUCGCUCG EukaryotesBacteria ?????ArchaebacteriaArchaea Isolate Ribosomes
  • 15. Archaea Woese: Classification of Cultured Taxa by rRNA PCR !15 rRNA rRNA PCR rRNA PCR ACUGC ACCUAU CGUUCG ACUCC AGCUAU CGAUCG ACCCC AGCUCU CGCUCG Taxa Characters S ACUGCACCUAUCGUUCG R ACUCCACCUAUCGUUCG E ACUCCAGCUAUCGAUCG F ACUCCAGGUAUCGAUCG C ACCCCAGCUCUCGCUCG W ACCCCAGCUCUGGCUCG Taxa Characters S ACUGCACCUAUCGUUCG E ACUCCAGCUAUCGAUCG C ACCCCAGCUCUCGCUCG EukaryotesBacteria Isolate DNA
  • 16. Archaea !16 rRNA rRNA PCR rRNA PCR EukaryotesBacteria Isolate DNA ACTGC ACCTAT CGTTCG ACTGC ACCTAT CGTTCG ACTGC ACCTAT CGTTCG Taxa Characters B1 ACTGCACCTATCGTTCG B2 ACTCCACCTATCGTTCG E1 ACTCCAGCTATCGATCG E2 ACTCCAGGTATCGATCG A1 ACCCCAGCTCTCGCTCG A2 ACCCCAGCTCTGGCTCG New1 ACTGCACCTATCGTTCG Phylotyping via rRNA PCR: One Taxon
  • 17. Chemosymbiont rRNA Phylotyping !17 Eisen et al. 1992. J. Bact.174: 3416Colleen Cavanaugh
  • 18. Taxa Characters B1 ACTGCACCTATCGTTCG B2 ACTCCACCTATCGTTCG E1 ACTCCAGCTATCGATCG E2 ACTCCAGGTATCGATCG A1 ACCCCAGCTCTCGCTCG A2 ACCCCAGCTCTGGCTCG New1 ACCCCAGCTCTGCCTCG New2 ACTGCACCTATCGTTCG Archaea EukaryotesBacteria ACTGC ACCTAT CGTTCG ACTGC ACCTAT CGTTCG ACCCC AGCTCT CGCTCG !18 rRNA rRNA PCR rRNA PCR Isolate DNA Phylotyping via rRNA PCR: Two Taxa
  • 19. ACTGC ACCTAT CGTTCG ACTCC AGCTAT CGATCG ACCCC AGCTCT CGCTCG AGGGG AGCTCT CGCTCG AGGGG AGCTCT CGCTCG ACTGC ACCTAT CGTTCG Taxa Characters B1 ACTGCACCTATCGTTCG B2 ACTCCACCTATCGTTCG E1 ACTCCAGCTATCGATCG E2 ACTCCAGGTATCGATCG A1 ACCCCAGCTCTCGCTCG A2 ACCCCAGCTCTGGCTCG New1 ACCCCAGCTCTGCCTCG New2 ACTGCACCTATCGTTCG New3 ACCCCAGCTCTCGCTCG
 New4 AGGGGAGCTCTCGCTCG Archaea EukaryotesBacteria !19 rRNA rRNA PCR rRNA PCR Isolate DNA Phylotyping via rRNA PCR: Four Taxa
  • 21. !21 Approaching to NGS Discovery of DNA structure (Cold Spring Harb. Symp. Quant. Biol. 1953;18:123-31) 1953 Sanger sequencing method by F. Sanger (PNAS ,1977, 74: 560-564) 1977 PCR by K. Mullis (Cold Spring Harb Symp Quant Biol. 1986;51 Pt 1:263-73) 1983 Development of pyrosequencing (Anal. Biochem., 1993, 208: 171-175; Science ,1998, 281: 363-365) 1993 1980 1990 2000 2010 Single molecule emulsion PCR 1998 Human Genome Project (Nature , 2001, 409: 860–92; Science, 2001, 291: 1304–1351) Founded 454 Life Science 2000 454 GS20 sequencer (First NGS sequencer) 2005 Founded Solexa 1998 Solexa Genome Analyzer (First short-read NGS sequencer) 2006 GS FLX sequencer (NGS with 400-500 bp read lenght) 2008 Hi-Seq2000 (200Gbp per Flow Cell) 2010 Illumina acquires Solexa (Illumina enters the NGS business) 2006 ABI SOLiD (Short-read sequencer based upon ligation) 2007 Roche acquires 454 Life Sciences (Roche enters the NGS business) 2007 NGS Human Genome sequencing (First Human Genome sequencing based upon NGS technology) 2008 From Slideshare presentation of Cosentino Cristian http://www.slideshare.net/cosentia/high-throughput-equencing Miseq Roche Jr Ion Torrent PacBio Oxford Automation is Critical AAATCGCTAGCGC CGGCGAGCTAGC CGAGCGATCGAGC CGAGCATCGAGTA
  • 22. STAP (for rRNA) An Automated Phylogenetic Tree-Based Small Subunit rRNA Taxonomy and Alignment Pipeline (STAP) Dongying Wu1 *, Amber Hartman1,6 , Naomi Ward4,5 , Jonathan A. Eisen1,2,3 1 UC Davis Genome Center, University of California Davis, Davis, California, United States of America, 2 Section of Evolution and Ecology, College of Biological Sciences, University of California Davis, Davis, California, United States of America, 3 Department of Medical Microbiology and Immunology, School of Medicine, University of California Davis, Davis, California, United States of America, 4 Department of Molecular Biology, University of Wyoming, Laramie, Wyoming, United States of America, 5 Center of Marine Biotechnology, Baltimore, Maryland, United States of America, 6 The Johns Hopkins University, Department of Biology, Baltimore, Maryland, United States of America Abstract Comparative analysis of small-subunit ribosomal RNA (ss-rRNA) gene sequences forms the basis for much of what we know about the phylogenetic diversity of both cultured and uncultured microorganisms. As sequencing costs continue to decline and throughput increases, sequences of ss-rRNA genes are being obtained at an ever-increasing rate. This increasing flow of data has opened many new windows into microbial diversity and evolution, and at the same time has created significant methodological challenges. Those processes which commonly require time-consuming human intervention, such as the preparation of multiple sequence alignments, simply cannot keep up with the flood of incoming data. Fully automated methods of analysis are needed. Notably, existing automated methods avoid one or more steps that, though computationally costly or difficult, we consider to be important. In particular, we regard both the building of multiple sequence alignments and the performance of high quality phylogenetic analysis to be necessary. We describe here our fully- automated ss-rRNA taxonomy and alignment pipeline (STAP). It generates both high-quality multiple sequence alignments and phylogenetic trees, and thus can be used for multiple purposes including phylogenetically-based taxonomic assignments and analysis of species diversity in environmental samples. The pipeline combines publicly-available packages (PHYML, BLASTN and CLUSTALW) with our automatic alignment, masking, and tree-parsing programs. Most importantly, this automated process yields results comparable to those achievable by manual analysis, yet offers speed and capacity that are unattainable by manual efforts. Citation: Wu D, Hartman A, Ward N, Eisen JA (2008) An Automated Phylogenetic Tree-Based Small Subunit rRNA Taxonomy and Alignment Pipeline (STAP). PLoS ONE 3(7): e2566. doi:10.1371/journal.pone.0002566 multiple alignment and phylogeny was deemed unfeasible. However, this we believe can compromise the value of the results. For example, the delineation of OTUs has also been automated via tools that do not make use of alignments or phylogenetic trees (e.g., Greengenes). This is usually done by carrying out pairwise comparisons of sequences and then clustering of sequences that have better than some cutoff threshold of similarity with each other). This approach can be powerful (and reasonably efficient) but it too has limitations. In particular, since multiple sequence alignments are not used, one cannot carry out standard phylogenetic analyses. In addition, without multiple sequence alignments one might end up comparing and contrasting different regions of a sequence depending on what it is paired with. The limitations of avoiding multiple sequence alignments and phylogenetic analysis are readily apparent in tools to classify sequences. For example, the Ribosomal Database Project’s Classifier program [29] focuses on composition characteristics of each sequence (e.g., oligonucleotide frequency) and assigns taxonomy based upon clustering genes by their composition. Though this is fast and completely automatable, it can be misled in cases where distantly related sequences have converged on similar composition, something known to be a major problem in ss-rRNA sequences [30]. Other taxonomy assignment systems focus primarily on the similarity of sequences. The simplest of these is classification tools it does have some limitations. For example, the generation of new alignments for each sequence is both computational costly, and does not take advantage of available curated alignments that make use of ss-RNA secondary structure to guide the primary sequence alignment. Perhaps most importantly however is that the tool is not fully automated. In addition, it does not generate multiple sequence alignments for all sequences in a dataset which would be necessary for doing many analyses. Automated methods for analyzing rRNA sequences are also available at the web sites for multiple rRNA centric databases, such as Greengenes and the Ribosomal Database Project (RDPII). Though these and other web sites offer diverse powerful tools, they do have some limitations. For example, not all provide multiple sequence alignments as output and few use phylogenetic approaches for taxonomy assignments or other analyses. More importantly, all provide only web-based interfaces and their integrated software, (e.g., alignment and taxonomy assignment), cannot be locally installed by the user. Therefore, the user cannot take advantage of the speed and computing power of parallel processing such as is available on linux clusters, or locally alter and potentially tailor these programs to their individual computing needs (Table 1). Given the limited automated tools that are available for Table 1. Comparison of STAP’s computational abilities relative to existing commonly-used ss-RNA analysis tools. STAP ARB Greengenes RDP Installed where? Locally Locally Web only Web only User interface Command line GUI Web portal Web portal Parallel processing YES NO NO NO Manual curation for taxonomy assignment NO YES NO NO Manual curation for alignment NO YES NO* NO Open source YES** NO NO NO Processing speed Fast Slow Medium Medium It is important to note, that STAP is the only software that runs on the command line and can take advantage of parallel processing on linux clusters and, further, is more amenable to downstream code manipulation. * Note: Greengenes alignment output is compatible with upload into ARB and downstream manual alignment. ** The STAP program itself is open source, the programs it depends on are freely available but not open source. doi:10.1371/journal.pone.0002566.t001 ss-rRNA Taxonomy Pipeline STAP database, and the query sequence is aligned to them using the CLUSTALW profile alignment algorithm [40] as described above for domain assignment. By adapting the profile alignment algorithm, th while gaps ar sequence ac Figure 1. A flow chart of the STAP pipeline. doi:10.1371/journal.pone.0002566.g001 STAP database, and the query sequence is aligned to them using the CLUSTALW profile alignment algorithm [40] as described above for domain assignment. By adapting the profile alignment algorithm, the alignments from the STAP database remain intact, while gaps are inserted and nucleotides are trimmed for the query sequence according to the profile defined by the previous alignments from the databases. Thus the accuracy and quality of the alignment generated at this step depends heavily on the quality of the Bacterial/Archaeal ss-rRNA alignments from the Greengenes project or the Eukaryotic ss-rRNA alignments from the RDPII project. Phylogenetic analysis using multiple sequence alignments rests on the assumption that the residues (nucleotides or amino acids) at the same position in every sequence in the alignment are homologous. Thus, columns in the alignment for which ‘‘positional homology’’ cannot be robustly determined must be excluded from subsequent analyses. This process of evaluating homology and eliminating questionable columns, known as masking, typically requires time- consuming, skillful, human intervention. We designed an automat- ed masking method for ss-rRNA alignments, thus eliminating this bottleneck in high-throughput processing. First, an alignment score is calculated for each aligned column by a method similar to that used in the CLUSTALX package [42]. Specifically, an R-dimensional sequence space representing all the possible nucleotide character states is defined. Then for each aligned column, the nucleotide populating that column in each of the aligned sequences is assigned a score in each of the R dimensions (Sr) according to the IUB matrix [42]. The consensus ‘‘nucleotide’’ for each column (X) also has R dimensions, with the Figure 2. Domain assignment. In Step 1, STAP assigns a domain to each query sequence based on its position in a maximum likelihood tree of representative ss-rRNA sequences. Because the tree illustrated here is not rooted, domain assignment would not be accurate and Figure 1. A flow chart of the STAP pipeline. doi:10.1371/journal.pone.0002566.g001 ss-rRNA Taxonomy Pipeline Dongying 
 Wu Amber Hartman Naomi Ward
  • 23. alignment used to build the profile, resulting in a multiple PD versus PID clustering, 2) to explore overlap between PhylOT Figure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylinders in this generaliz workflow of PhylOTU. See Results section for details. doi:10.1371/journal.pcbi.1001061.g001 Finding Metagenomic OTU Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O'Dwyer JP, Green JL, Eisen JA, Pollard KS. (2011) PhylOTU: A High-Throughput Procedure Quantifies Microbial Community Diversity and Resolves Novel Taxa from Metagenomic Data. PLoS Comput Biol 7(1): e1001061. doi: 10.1371/journal.pcbi.1001061 PhylOTU Tom Sharpton Katie Pollard Jessica Green
  • 24. !24 rRNA PCR: Community Comparisons
  • 25. Taxa Characters B1 ACTGCACCTATCGTTCG B2 ACTCCACCTATCGTTCG E1 ACTCCAGCTATCGATCG E2 ACTCCAGGTATCGATCG A1 ACCCCAGCTCTCGCTCG A2 ACCCCAGCTCTGGCTCG New1 ACCCCAGCTCTGCCTCG New2 ACTGCACCTATCGTTCG New3 ACCCCAGCTCTCGCTCG
 New4 AGGGGAGCTCTCGCTCG Archaea EukaryotesBacteria !24 rRNA rRNA PCR rRNA PCR Isolate DNA rRNA PCR: Community Comparisons A A A A AA A A A A AA A A A A A AA A A
  • 26. Taxa Characters B1 ACTGCACCTATCGTTCG B2 ACTCCACCTATCGTTCG E1 ACTCCAGCTATCGATCG E2 ACTCCAGGTATCGATCG A1 ACCCCAGCTCTCGCTCG A2 ACCCCAGCTCTGGCTCG New1 ACCCCAGCTCTGCCTCG New2 ACTGCACCTATCGTTCG New3 ACCCCAGCTCTCGCTCG
 New4 AGGGGAGCTCTCGCTCG !25 rRNA rRNA PCR rRNA PCR Isolate DNA rRNA PCR: Community Comparisons A A A A AA A A A A AA A A A A A AA A A
  • 27. Hartman et al. BMC Bioinformatics 2010, 11:317 http://www.biomedcentral.com/1471-2105/11/317 Open AccessSOFTWARE Software Introducing W.A.T.E.R.S.: a Workflow for the Alignment, Taxonomy, and Ecology of Ribosomal Sequences Amber L Hartman†1,3, Sean Riddle†2, Timothy McPhillips2, Bertram Ludäscher2 and Jonathan A Eisen*1 Abstract Background: For more than two decades microbiologists have used a highly conserved microbial gene as a phylogenetic marker for bacteria and archaea. The small-subunit ribosomal RNA gene, also known as 16 S rRNA, is encoded by ribosomal DNA, 16 S rDNA, and has provided a powerful comparative tool to microbial ecologists. Over time, the microbial ecology field has matured from small-scale studies in a select number of environments to massive collections of sequence data that are paired with dozens of corresponding collection variables. As the complexity of data and tool sets have grown, the need for flexible automation and maintenance of the core processes of 16 S rDNA sequence analysis has increased correspondingly. Results: We present WATERS, an integrated approach for 16 S rDNA analysis that bundles a suite of publicly available 16 S rDNA analysis software tools into a single software package. The "toolkit" includes sequence alignment, chimera removal, OTU determination, taxonomy assignment, phylogentic tree construction as well as a host of ecological analysis and visualization tools. WATERS employs a flexible, collection-oriented 'workflow' approach using the open- source Kepler system as a platform. Conclusions: By packaging available software tools into a single automated workflow, WATERS simplifies 16 S rDNA analyses, especially for those without specialized bioinformatics, programming expertise. In addition, WATERS, like some of the newer comprehensive rRNA analysis tools, allows researchers to minimize the time dedicated to carrying out tedious informatics steps and to focus their attention instead on the biological interpretation of the results. One advantage of WATERS over other comprehensive tools is that the use of the Kepler workflow system facilitates result interpretation and reproducibility via a data provenance sub-system. Furthermore, new "actors" can be added to the workflow as desired and we see WATERS as an initial seed for a sizeable and growing repository of interoperable, easy- to-combine tools for asking increasingly complex microbial ecology questions. Background Microbial communities and how they are surveyed Microbial communities abound in nature and are crucial for the success and diversity of ecosystems. There is no end in sight to the number of biological questions that can be asked about microbial diversity on earth. From animal and human guts to open ocean surfaces and deep sea hydrothermal vents, to anaerobic mud swamps or boiling thermal pools, to the tops of the rainforest canopy and the frozen Antarctic tundra, the composition of microbial communities is a source of natural history, intellectual curiosity, and reservoir of environmental health [1]. Microbial communities are also mediators of insight into global warming processes [2,3], agricultural success [4], pathogenicity [5,6], and even human obesity [7,8]. In the mid-1980 s, researchers began to sequence ribo- somal RNAs from environmental samples in order to characterize the types of microbes present in those sam- ples, (e.g., [9,10]). This general approach was revolution- ized by the invention of the polymerase chain reaction (PCR), which made it relatively easy to clone and then * Correspondence: jaeisen@ucdavis.edu 1 Department of Medical Microbiology and Immunology and the Department of Evolution and Ecology, Genome Center, University of California Davis, One Shields Avenue, Davis, CA, 95616, USA † Contributed equally Full list of author information is available at the end of the article WATERS - Kepler Workflow for rRNA matics 2010, 11:317 .com/1471-2105/11/317 Page 2 of 14 genes for ribosomal RNA) in partic- ubunit ribosomal RNA (ss-rRNA). ed a large amount of previously l diversity [1,11-13]. Researchers all subunit rRNA gene not only ith which it can be PCR amplified, has variable and highly conserved to be universally distributed among nd it is useful for inferring phyloge- 4,15]. Since then, "cultivation-inde- " have brought a revolution to the by allowing scientists to study a mount of diversity in many different ments [16-18]. The general premise Figure 1 Overview of WATERS. Schema of WATERS where white boxes indicate "behind the scenes" analyses that are performed in WA- Align Check chimeras Cluster Build Tree Assign Taxonomy Tree w/ Taxonomy Diversity statistics & graphs Unifrac files Cytoscape network OTU table Hartman et al. BMC Bioinformatics 2010, 11:317 http://www.biomedcentral.com/1471-2105/11/317 Page 3 of 14 Motivations As outlined above, successfully processing microbial sequence collections is far from trivial. Each step is com- plex and usually requires significant bioinformatics expertise and time investment prior to the biological interpretation. In order to both increase efficiency and ensure that all best-practice tools are easily usable, we sought to create an "all-inclusive" method for performing all of these bioinformatics steps together in one package. To this end, we have built an automated, user-friendly, workflow-based system called WATERS: a Workflow for the Alignment, Taxonomy, and Ecology of Ribosomal Sequences (Fig. 1). In addition to being automated and simple to use, because WATERS is executed in the Kepler scientific workflow system (Fig. 2) it also has the advan- tage that it keeps track of the data lineage and provenance of data products [23,24]. Automation The primary motivation in building WATERS was to minimize the technical, bioinformatics challenges that arise when performing DNA sequence clustering, phylo- genetic tree, and statistical analyses by automating the 16 S rDNA analysis workflow. We also hoped to exploit additional features that workflow-based approaches entail, such as optimized execution and data lineage tracking and browsing [23,25-27]. In the earlier days of 16 S rDNA analysis, simply knowing which microbes were present and whether they were biologically novel was a noteworthy achievement. It was reasonable and expected, therefore, to invest a large amount of time and effort to get to that list of microbes. But now that current efforts are significantly more advanced and often require com- parison of dozens of factors and variables with datasets of thousands of sequences, it is not practically feasible to process these large collections "by hand", and hugely inef- ficient if instead automated methods can be successfully employed. Broadening the user base A second motivation and perspective is that by minimiz- ing the technical difficulty of 16 S rDNA analysis through the use of WATERS, we aim to make the analysis of these datasets more widely available and allow individuals with Figure 2 Screenshot of WATERS in Kepler software. Key features: the library of actors un-collapsed and displayed on the left-hand side, the input and output paths where the user declares the location of their input files and desired location for the results files. Each green box is an individual Kepler actor that performs a single action on the data stream. The connectors (black arrows) direct and hook up the actors in a defined sequence. Double- clicking on any actor or connector allows it to be manipulated and re-arranged. Hartman et al. BMC Bioinformatics 2010, 11:317 http://www.biomedcentral.com/1471-2105/11/317 Page 9 of default is 97% and 99%), and they are also generated for every metadata variable comparison that the user includes. Data pruning To assist in troubleshooting and quality contro WATERS returns to the user three fasta files of sequenc Figure 3 Biologically similar results automatically produced by WATERS on published colonic microbiota samples. (A) Rarefaction curves sim ilar to curves shown in Eckburg et al. Fig. 2; 70-72, indicate patient numbers, i.e., 3 different individuals. (B) Weighted Unifrac analysis based on phylo genetic tree and OTU data produced by WATERS very similar to Eckburg et al. Fig. 3B. (C) Neighbor-joining phylogenetic tree (Quicktree) representing the sequences analyzed by WATERS, which is clearly similar to Fig. S1 in Eckburg et al. BA 3 3HUFHQW YDULDWLRQ H[SODLQHG 33HUFHQWYDULDWLRQH[SODLQHG $% & '( ) 6 $ % & '( ) 6 $ %& ' () 6 3&$ 3 YV 3 C %$&7(52,'(7(6 %$&7(52,'$/(6 '(/7$3527(2%$&7(5,$ $&7,12%$&7(5,$ 9(558&20,&52%,$ (36,/213527(2%$&7(5,$ ),50,&87(6 &/2675,',$ &/2675,',$/(6 *$00$3527(2%$&7(5,$ &<$12%$&7(5,$ $/3+$3527(2%$&7(5,$ )862%$&7(5,$ ),50,&87(6 %$&,//, ),50,&87(6 02//,&87(6 Amber
 Hartman
  • 28. Tree from Woese. 1987. Microbiological Reviews 51:221 rRNA Not Perfect Nothing is Perfect
  • 29. rRNA Phylogeny Copy # Correction Kembel SW, Wu M, Eisen JA, Green JL (2012) Incorporating 16S Gene Copy Number Information Improves Estimates of Microbial Diversity and Abundance. PLoS Comput Biol 8(10): e1002743. doi: 10.1371/journal.pcbi. 1002743 Steven Kembel Jessica Green Martin Wu
  • 30. Tree Complications 1 !29 rRNA rRNArRNA ACUGC ACCUAU CGUUCG ACUCC AGCUAU CGAUCG ACCCC AGCUCU CGCUCG Taxa Characters S ACUGCACCUAUCGUUCG R ACUCCACCUAUCGUUCG E ACUCCAGCUAUCGAUCG F ACUCCAGGUAUCGAUCG C ACCCCAGCUCUCGCUCG W ACCCCAGCUCUGGCUCG Taxa Characters S ACUGCACCUAUCGUUCG E ACUCCAGCUAUCGAUCG C ACCCCAGCUCUCGCUCG EuksBacteria Arch Isolate Ribosomes Arch
  • 31. Tree Complications 2 !30 rRNA rRNArRNA ACUGC ACCUAU CGUUCG ACUCC AGCUAU CGAUCG ACCCC AGCUCU CGCUCG Taxa Characters S ACUGCACCUAUCGUUCG R ACUCCACCUAUCGUUCG E ACUCCAGCUAUCGAUCG F ACUCCAGGUAUCGAUCG C ACCCCAGCUCUCGCUCG W ACCCCAGCUCUGGCUCG Taxa Characters S ACUGCACCUAUCGUUCG E ACUCCAGCUAUCGAUCG C ACCCCAGCUCUCGCUCG EuksBacteria Arch Isolate Ribosomes Arch
  • 32. Tree Complications 3 !31 rRNA rRNArRNA ACUGC ACCUAU CGUUCG ACUCC AGCUAU CGAUCG ACCCC AGCUCU CGCUCG Taxa Characters S ACUGCACCUAUCGUUCG R ACUCCACCUAUCGUUCG E ACUCCAGCUAUCGAUCG F ACUCCAGGUAUCGAUCG C ACCCCAGCUCUCGCUCG W ACCCCAGCUCUGGCUCG Taxa Characters S ACUGCACCUAUCGUUCG E ACUCCAGCUAUCGAUCG C ACCCCAGCUCUCGCUCG EuksBacteria Arch Isolate Ribosomes Arch
  • 33. Automated Accurate Genome Tree Lang JM, Darling AE, Eisen JA (2013) Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees and Supermatrices. PLoS ONE 8(4): e62510. doi:10.1371/journal.pone.0062510 Jenna Lang Aaron Darling
  • 35. Metagenomics metagenomics ACUGC ACCUAU CGUUCG ACUCC AGCUAU CGAUCG ACCCC AGCUCU CGCUCG Taxa Characters S ACUGCACCUAUCGUUCG R ACUCCACCUAUCGUUCG E ACUCCAGCUAUCGAUCG F ACUCCAGGUAUCGAUCG C ACCCCAGCUCUCGCUCG W ACCCCAGCUCUGGCUCG Taxa Characters S ACUGCACCUAUCGUUCG E ACUCCAGCUAUCGAUCG C ACCCCAGCUCUCGCUCG EukaryotesBacteria Archaea
  • 36. inputs of fixed carbon or nitrogen from external sources. As with Leptospirillum group I, both Leptospirillum group II and III have the genes needed to fix carbon by means of the Calvin–Benson– Bassham cycle (using type II ribulose 1,5-bisphosphate carboxy- lase–oxygenase). All genomes recovered from the AMD system contain formate hydrogenlyase complexes. These, in combination with carbon monoxide dehydrogenase, may be used for carbon fixation via the reductive acetyl coenzyme A (acetyl-CoA) pathway by some, or all, organisms. Given the large number of ABC-type sugar and amino acid transporters encoded in the Ferroplasma type Figure 4 Cell metabolic cartoons constructed from the annotation of 2,180 ORFs identified in the Leptospirillum group II genome (63% with putative assigned function) and 1,931 ORFs in the Ferroplasma type II genome (58% with assigned function). The cell cartoons are shown within a biofilm that is attached to the surface of an acid mine drainage stream (viewed in cross-section). Tight coupling between ferrous iron oxidation, pyrite dissolution and acid generation is indicated. Rubisco, ribulose 1,5-bisphosphate carboxylase–oxygenase. THF, tetrahydrofolate. articles NATURE | doi:10.1038/nature02340 | www.nature.com/nature 5©2004 NaturePublishing Group Metagenomics metagenomics ACUGC ACCUAU CGUUCG ACUCC AGCUAU CGAUCG ACCCC AGCUCU CGCUCG Taxa Characters S ACUGCACCUAUCGUUCG R ACUCCACCUAUCGUUCG E ACUCCAGCUAUCGAUCG F ACUCCAGGUAUCGAUCG C ACCCCAGCUCUCGCUCG W ACCCCAGCUCUGGCUCG Taxa Characters S ACUGCACCUAUCGUUCG E ACUCCAGCUAUCGAUCG C ACCCCAGCUCUCGCUCG
  • 37. Metagenomics metagenomics ACUGC ACCUAU CGUUCG ACUCC AGCUAU CGAUCG ACCCC AGCUCU CGCUCG Taxa Characters S ACUGCACCUAUCGUUCG R ACUCCACCUAUCGUUCG E ACUCCAGCUAUCGAUCG F ACUCCAGGUAUCGAUCG C ACCCCAGCUCUCGCUCG W ACCCCAGCUCUGGCUCG Taxa Characters S ACUGCACCUAUCGUUCG E ACUCCAGCUAUCGAUCG C ACCCCAGCUCUCGCUCG
  • 38. Culture Independent “Metagenomics” DNA DNADNA !35 Taxa Characters B1 ACTGCACCTATCGTTCG B2 ACTCCACCTATCGTTCG E1 ACTCCAGCTATCGATCG E2 ACTCCAGGTATCGATCG A1 ACCCCAGCTCTCGCTCG A2 ACCCCAGCTCTGGCTCG New1 ACCCCAGCTCTGCCTCG New2 AGGGGAGCTCTGCCTCG New3 ACTCCAGCTATCGATCG New4 ACTGCACCTATCGTTCG RecA RecARecA http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.7 Genome Biology 2008, 9:R151 sequences are not conserved at the nucleotide level [29]. As a result, the nr database does not actually contain many more protein marker sequences that can be used as references than those available from complete genome sequences. Comparison of phylogeny-based and similarity-based phylotyping Although our phylogeny-based phylotyping is fully auto- mated, it still requires many more steps than, and is slower than, similarity based phylotyping methods such as a MEGAN [30]. Is it worth the trouble? Similarity based phylo- typing works by searching a query sequence against a refer- ence database such as NCBI nr and deriving taxonomic information from the best matches or 'hits'. When species that are closely related to the query sequence exist in the ref- erence database, similarity-based phylotyping can work well. However, if the reference database is a biased sample or if it contains no closely related species to the query, then the top hits returned could be misleading [31]. Furthermore, similar- ity-based methods require an arbitrary similarity cut-off value to define the top hits. Because individual bacterial genomes and proteins can evolve at very different rates, a uni- versal cut-off that works under all conditions does not exist. As a result, the final results can be very subjective. In contrast, our tree-based bracketing algorithm places the query sequence within the context of a phylogenetic tree and only assigns it to a taxonomic level if that level has adequate sampling (see Materials and methods [below] for details of the algorithm). With the well sampled species Prochlorococ- cus marinus, for example, our method can distinguish closely related organisms and make taxonomic identifications at the species level. Our reanalysis of the Sargasso Sea data placed 672 sequences (3.6% of the total) within a P. marinus clade. On the other hand, for sparsely sampled clades such as Aquifex, assignments will be made only at the phylum level. Thus, our phylogeny-based analysis is less susceptible to data sampling bias than a similarity based approach, and it makes Major phylotypes identified in Sargasso Sea metagenomic dataFigure 3 Major phylotypes identified in Sargasso Sea metagenomic data. The metagenomic data previously obtained from the Sargasso Sea was reanalyzed using AMPHORA and the 31 protein phylogenetic markers. The microbial diversity profiles obtained from individual markers are remarkably consistent. The breakdown of the phylotyping assignments by markers and major taxonomic groups is listed in Additional data file 5. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Alphaproteobacteria Betaproteobacteria G am m aproteobacteria D eltaproteobacteria Epsilonproteobacteria U nclassified proteobacteria Bacteroidetes C hlam ydiae C yanobacteria Acidobacteria Therm otogae Fusobacteria ActinobacteriaAquificae Planctom ycetes Spirochaetes Firm icutes C hloroflexiC hlorobi U nclassified bacteria dnaG frr infC nusA pgk pyrG rplA rplB rplC rplD rplE rplF rplK rplL rplM rplN rplP rplS rplT rpmA rpoB rpsB rpsC rpsE rpsI rpsJ rpsK rpsM rpsS smpB tsf Relativeabundance RpoB RpoBRpoB Rpl4 Rpl4Rpl4 rRNA rRNArRNA Hsp70 Hsp70Hsp70 EFTu EFTuEFTu Many other genes better than rRNA
  • 40. Phylotyping w/ Protein Markers AMPHORA http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Alphaproteobacteria Betaproteobacteria G am m aproteobacteria D eltaproteobacteria Epsilonproteobacteria U nclassified proteobacteria Bacteroidetes C hlam ydiae C yanobacteria Acidobacteria Therm otogae Fusobacteria ActinobacteriaAquificae Planctom ycetes Spirochaetes Firm icutes C hloroflexiC hlorobi U nclassified bacteria dnaG frr infC nusA pgk pyrG rplA rplB rplC rplD rplE rplF rplK rplL rplM rplN rplP rplS rplT rpmA rpoB rpsB rpsC rpsE rpsI rpsJ rpsK rpsM rpsS smpB tsf Relativeabundance Martin Wu
  • 41. GOS 1 GOS 2 GOS 3 GOS 4 GOS 5 Phylogenetic ID of Novel Lineages Dongying 
 Wu Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M, et al. (2011) Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, and Interpreting Novel, Deep Branches in Marker Gene Phylogenetic Trees. PLoS ONE 6(3): e18011. doi: 10.1371/journal.pone.0018011
  • 42. Phylogenetic Diversity of Metagenomes typically used as a qualitative measure because duplicate s quences are usually removed from the tree. However, the test may be used in a semiquantitative manner if all clone even those with identical or near-identical sequences, are i cluded in the tree (13). Here we describe a quantitative version of UniFrac that w call “weighted UniFrac.” We show that weighted UniFrac b haves similarly to the FST test in situations where both a FIG. 1. Calculation of the unweighted and the weighted UniFr measures. Squares and circles represent sequences from two differe environments. (a) In unweighted UniFrac, the distance between t circle and square communities is calculated as the fraction of t branch length that has descendants from either the square or the circ environment (black) but not both (gray). (b) In weighted UniFra branch lengths are weighted by the relative abundance of sequences the square and circle communities; square sequences are weight twice as much as circle sequences because there are twice as many tot circle sequences in the data set. The width of branches is proportion to the degree to which each branch is weighted in the calculations, an gray branches have no weight. Branches 1 and 2 have heavy weigh since the descendants are biased toward the square and circles, respe tively. Branch 3 contributes no value since it has an equal contributio from circle and square sequences after normalization. Kembel SW, Eisen JA, Pollard KS, Green JL (2011) The Phylogenetic Diversity of Metagenomes. PLoS ONE 6(8): e23214. doi:10.1371/journal.pone.0023214 Jessica Green Steven Kembel Katie Pollard
  • 43. Phylosift Input Sequences rRNA workflow protein workflow profile HMMs used to align candidates to reference alignment Taxonomic Summaries parallel option hmmalign multiple alignment LAST fast candidate search pplacer phylogenetic placement LAST fast candidate search LAST fast candidate search search input against references hmmalign multiple alignment hmmalign multiple alignment Infernal multiple alignment LAST fast candidate search <600 bp >600 bp Sample Analysis & Comparison Krona plots, Number of reads placed for each marker gene Edge PCA, Tree visualization, Bayes factor tests eachinputsequencescannedagainstbothworkflows Aaron Darling @koadman Erik Matsen @ematsen Holly Bik @hollybik Guillaume Jospin @guillaumejospin Darling AE, Jospin G, Lowe E, Matsen FA IV, Bik HM, Eisen JA. (2014) PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ 2:e243 http://dx.doi.org/10.7717/peerj. 243 Erik Lowe
  • 44. Edge PCA: Identify lineages that explain most variation among samples Edge PCA - Matsen and Evans 2013 Output: Edge PCA
  • 45. Using Phylogeny 2: Functional Prediction
  • 46. PHYLOGENENETIC PREDICTION OF GENE FUNCTION IDENTIFY HOMOLOGS OVERLAY KNOWN FUNCTIONS ONTO TREE INFER LIKELY FUNCTION OF GENE(S) OF INTEREST 1 2 3 4 5 6 3 5 3 1A 2A 3A 1B 2B 3B 2A 1B 1A 3A 1B 2B 3B ALIGN SEQUENCES CALCULATE GENE TREE 1 2 4 6 CHOOSE GENE(S) OF INTEREST 2A 2A 5 3 Species 3Species 1 Species 2 1 1 2 2 2 31 1A 3A 1A 2A 3A 1A 2A 3A 4 6 4 5 6 4 5 6 2B 3B 1B 2B 3B 1B 2B 3B ACTUAL EVOLUTION (ASSUMED TO BE UNKNOWN) Duplication? EXAMPLE A EXAMPLE B Duplication? Duplication? Duplication 5 METHOD Ambiguous Based on Eisen, 1998 Genome Res 8: 163-167. Phylogenomics
  • 47. Overlaying Functions onto Tree Aquae Trepa Rat Fly Xenla Mouse Human Yeast Neucr Arath Borbu Synsp Neigo Thema Strpy Bacsu Ecoli TheaqDeira Chltr Spombe Yeast Yeast Spombe Mouse Human Arath Yeast Human Mouse Arath StrpyBacsu Human Celeg Yeast MetthBorbu Aquae Synsp Deira Helpy mSaco Yeast Celeg Human MSH4 MSH5 MutS2 MutS1 MSH1 MSH3 MSH6 MSH2 Based on Eisen, 1998
 Nucl Acids Res 26: 4291-4300.
  • 48. Phylogenomics ~~ Phylotyping Eisen et al. 1992Eisen et al. 1992. J. Bact.174: 3416
  • 49. Proteorhodopsin Functional Diversity Venter et al., Science 304: 66. 2004
  • 51. dFunctional Prediction from Metagenomes DNA DNADNA !23 Taxa Characters B1 ACTGCACCTATCGTTCG B2 ACTCCACCTATCGTTCG E1 ACTCCAGCTATCGATCG E2 ACTCCAGGTATCGATCG A1 ACCCCAGCTCTCGCTCG A2 ACCCCAGCTCTGGCTCG New1 ACCCCAGCTCTGCCTCG New2 AGGGGAGCTCTGCCTCG New3 ACTCCAGCTATCGATCG New4 ACTGCACCTATCGTTCG inputs of fixed carbon or nitrogen from external sources. As with Leptospirillum group I, both Leptospirillum group II and III have the genes needed to fix carbon by means of the Calvin–Benson– Bassham cycle (using type II ribulose 1,5-bisphosphate carboxy- lase–oxygenase). All genomes recovered from the AMD system contain formate hydrogenlyase complexes. These, in combination with carbon monoxide dehydrogenase, may be used for carbon fixation via the reductive acetyl coenzyme A (acetyl-CoA) pathway by some, or all, organisms. Given the large number of ABC-type sugar and amino acid transporters encoded in the Ferroplasma type Figure 4 Cell metabolic cartoons constructed from the annotation of 2,180 ORFs identified in the Leptospirillum group II genome (63% with putative assigned function) and 1,931 ORFs in the Ferroplasma type II genome (58% with assigned function). The cell cartoons are shown within a biofilm that is attached to the surface of an acid mine drainage stream (viewed in cross-section). Tight coupling between ferrous iron oxidation, pyrite dissolution and acid generation is indicated. Rubisco, ribulose 1,5-bisphosphate carboxylase–oxygenase. THF, tetrahydrofolate. articles NATURE | doi:10.1038/nature02340 | www.nature.com/nature 5©2004 NaturePublishing Group
  • 52. Phylogenetic Prediction of Function • Many powerful and automated similarity based methods for assigning genes to protein families • COGs • PFAM HMM searches • Some limitations of similarity based methods can be overcome by phylogenetic approaches • Automated methods now available • Sean Eddy • Steven Brenner • Kimmen Sjölander
  • 53. Phylogenetic Prediction of Function • Many powerful and automated similarity based methods for assigning genes to protein families • COGs • PFAM HMM searches • Some limitations of similarity based methods can be overcome by phylogenetic approaches • Automated methods now available • Sean Eddy • Steven Brenner • Kimmen Sjölander • But …
  • 54. Carboxydothermus hydrogenoformans • Isolated from a Russian hotspring • Thermophile (grows at 80°C) • Anaerobic • Grows very efficiently on CO (Carbon Monoxide) • Produces hydrogen gas • Low GC Gram positive (Firmicute) • Genome Determined (Wu et al. 2005 PLoS Genetics 1: e65. )
  • 55. Homologs of Sporulation Genes Wu et al. 2005 PLoS Genetics 1: e65.
  • 56. Carboxydothermus sporulates Wu et al. 2005 PLoS Genetics 1: e65.
  • 57. Non-Homology Predictions: Phylogenetic Profiling • Step 1: Search all genes in organisms of interest against all other genomes • Ask: Yes or No, is each gene found in each other species • Cluster genes by distribution patterns (profiles)
  • 58. Sporulation Gene Profile Wu et al. 2005 PLoS Genetics 1: e65.
  • 59. B. subtilis new sporulation genes J Bacteriol. 2013 Jan;195(2):253-60. doi: 10.1128/JB.01778-12 Bjorn Traag Richard Losick
  • 60. Phylogenetic Profiling for Metagenomics?
  • 61. Using Phylogeny 3: Linking Function and Phylogeny
  • 62. HiC Crosslinking & Sequencing Beitel CW, Froenicke L, Lang JM, Korf IF, Michelmore RW, Eisen JA, Darling AE. (2014) Strain- and plasmid- level deconvolution of a synthetic metagenome by sequencing proximity ligation products. PeerJ 2:e415 http://dx.doi.org/10.7717/peerj.415 Table 1 Species alignment fractions. The number of reads aligning to each replicon present in the synthetic microbial community are shown before and after filtering, along with the percent of total constituted by each species. The GC content (“GC”) and restriction site counts (“#R.S.”) of each replicon, species, and strain are shown. Bur1: B. thailandensis chromosome 1. Bur2: B. thailandensis chromosome 2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2: L. brevis plasmid 2, Ped: P. pentosaceus, K12: E. coli K12 DH10B, BL21: E. coli BL21. An expanded version of this table can be found in Table S2. Sequence Alignment % of Total Filtered % of aligned Length GC #R.S. Lac0 10,603,204 26.17% 10,269,562 96.85% 2,291,220 0.462 629 Lac1 145,718 0.36% 145,478 99.84% 13,413 0.386 3 Lac2 691,723 1.71% 665,825 96.26% 35,595 0.385 16 Lac 11,440,645 28.23% 11,080,865 96.86% 2,340,228 0.46 648 Ped 2,084,595 5.14% 2,022,870 97.04% 1,832,387 0.373 863 BL21 12,882,177 31.79% 2,676,458 20.78% 4,558,953 0.508 508 K12 9,693,726 23.92% 1,218,281 12.57% 4,686,137 0.507 568 E. coli 22,575,903 55.71% 3,894,739 17.25% 9,245,090 0.51 1076 Bur1 1,886,054 4.65% 1,797,745 95.32% 2,914,771 0.68 144 Bur2 2,536,569 6.26% 2,464,534 97.16% 3,809,201 0.672 225 Bur 4,422,623 10.91% 4,262,279 96.37% 6,723,972 0.68 369 Figure 1 Hi-C insert distribution. The distribution of genomic distances between Hi-C read pairs is shown for read pairs mapping to each chromosome. For each read pair the minimum path length on the circular chromosome was calculated and read pairs separated by less than 1000 bp were discarded. The 2.5 Mb range was divided into 100 bins of equal size and the number of read pairs in each bin was recorded for each chromosome. Bin values for each chromosome were normalized to sum to 1 and plotted. E. coli K12 genome were distributed in a similar manner as previously reported (Fig. 1; (Lieberman-Aiden et al., 2009)). We observed a minor depletion of alignments spanning the linearization point of the E. coli K12 assembly (e.g., near coordinates 0 and 4686137) due to edge eVects induced by BWA treating the sequence as a linear chromosome rather than circular. 10.7717/peerj.415 9/19 Figure 2 Metagenomic Hi-C associations. The log-scaled, normalized number of Hi-C read pairs associating each genomic replicon in the synthetic community is shown as a heat map (see color scale, blue to yellow: low to high normalized, log scaled association rates). Bur1: B. thailandensis chromosome 1. Bur2: B. thailandensis chromosome 2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2: L. brevis plasmid 2, Ped: P. pentosaceus, K12: E. coli K12 DH10B, BL21: E. coli BL21. reference assemblies of the members of our synthetic microbial community with the same alignment parameters as were used in the top ranked clustering (described above). We first Figure 3 Contigs associated by Hi-C reads. A graph is drawn with nodes depicting contigs and edges depicting associations between contigs as indicated by aligned Hi-C read pairs, with the count thereof depicted by the weight of edges. Nodes are colored to reflect the species to which they belong (see legend) with node size reflecting contig size. Contigs below 5 kb and edges with weights less than 5 were excluded. Contig associations were normalized for variation in contig size. typically represent the reads and variant sites as a variant graph wherein variant sites are represented as nodes, and sequence reads define edges between variant sites observed in the same read (or read pair). We reasoned that variant graphs constructed from Hi-C data would have much greater connectivity (where connectivity is defined as the mean path length between randomly sampled variant positions) than graphs constructed from mate-pair sequencing data, simply because Hi-C inserts span megabase distances. Such Figure 4 Hi-C contact maps for replicons of Lactobacillus brevis. Contact maps show the number of Hi-C read pairs associating each region of the L. brevis genome. The L. brevis chromosome (Lac0, (A), Chris Beitel @datscimed Aaron Darling @koadman
  • 63. Pink Berries PB-PSB1 (Purple sulfur bacteria) PB-SRB1 (Sulfate reducing bacteria) (sulfate) (sulfide) Wilbanks, E.G. et al (2014). Environmental Microbiology Lizzy Wilbanks @lizzywilbanks
  • 64. Long Reads Help, A Lot Hiseq & Miseq 100-250 bp Moleculo 2-20 kb Pacbio RSII 2-20kb Micky Kertesz, Tim Blauwcamp Meredith Ashby Cheryl Heiner Illumina-based “synthetic long reads” Real-time single molecule sequencing (p4-c2, p5-c3) 295 Megabases 474 Megabases61 Gigabases
  • 65. Using Phylogeny 4: Better Reference Data
  • 66. PhyEco Markers Phylogenetic group Genome Number Gene Number Maker Candidates Archaea 62 145415 106 Actinobacteria 63 267783 136 Alphaproteobacteria 94 347287 121 Betaproteobacteria 56 266362 311 Gammaproteobacteria 126 483632 118 Deltaproteobacteria 25 102115 206 Epislonproteobacteria 18 33416 455 Bacteriodes 25 71531 286 Chlamydae 13 13823 560 Chloroflexi 10 33577 323 Cyanobacteria 36 124080 590 Firmicutes 106 312309 87 Spirochaetes 18 38832 176 Thermi 5 14160 974 Thermotogae 9 17037 684 Wu D, Jospin G, Eisen JA (2013) Systematic Identification of Gene Families for Use as “Markers” for Phylogenetic and Phylogeny-Driven Ecological Studies of Bacteria and Archaea and Their Major Subgroups. PLoS ONE 8(10): e77033. doi:10.1371/journal.pone.0077033
  • 67. Better Protein Families Representative Genomes Extract Protein Annotation All v. All BLAST Homology Clustering (MCL) SFams Align & Build HMMs HMMs Screen for Homologs New Genomes Extract Protein Annotation Figure 1 Sharpton et al. 2012.BMC bioinformatics, 13(1), 264. A B C
  • 69. Microbial Dark Matter Part 2 • Ramunas Stepanauskas • Tanja Woyke • Jonathan Eisen • Duane Moser • Tullis Onstott
  • 71. Phylogeny Isn’t Everything .. Model Systems
  • 72. Wu et al. 2006 PLoS Biology 4: e188. Baumannia makes vitamins and cofactors Sulcia makes amino acids Simple Symbioses Wu et al. 2006 PLoS Biology 4: e188. Baumannia makes vitamins and cofactors Sulcia makes amino acids Phylogenetic Binning Nancy Moran Dongying Wu
  • 73. Drosophila microbiome w/ Kopp Lab Both natural surveys and laboratory experiments indicate that host diet plays a major role in shaping the Drosophila bacterial microbiome. Laboratory strains provide only a limited model of natural host–microbe interactions Jenna Lang Angus Chandler
  • 74. Rice Microbiome w/ Sundar Lab Edwards et al. 2015. Structure, variation, and assembly of the root-associated microbiomes of rice. PNAS 9 Supplementary Figures31 32 Fig. S1 Map depicting soil collection locations for greenhouse experiment.33 10 234 Fig. S2. Sampling and collection of the rhizocompartments. Roots are collected from rice235 plants and soil is shaken off the roots to leave ~1mm of soil around the roots. The ~1 mm of soil236 three separate rhizocompartments: the rhizosphere, rhizoplane, and endosphere (Fig. 1A). Because the root microbiome has been shown to correlate with the developmental stage of the plant (10), the root-associated microbial communities were sampled at 42 d (6 wk), when rice plants from all genotypes were well-established in the soil but still in their vegetative phase of growth. For our study, the rhizosphere compartment was com- w i t i ( t s z i m a r t t ( t m P h t P p ( i M P a t o s q a n v v p t p s G Fig. 1. Root-associated microbial communities are separable by rhizo- compartment and soil type. (A) A representation of a rice root cross-section depicting the locations of the microbial communities sampled. (B) Within- sample diversity (α-diversity) measurements between rhizospheric compart- ments indicate a decreasing gradient in microbial diversity from the rhizo- sphere to the endosphere independent of soil type. Estimated species richness was calculated as eShannon_entropy . The horizontal bars within boxes represent median. The tops and bottoms of boxes represent 75th and 25th quartiles, respectively. The upper and lower whiskers extend 1.5× the interquartile range from the upper edge and lower edge of the box, re- spectively. All outliers are plotted as individual points. (C) PCoA using the WUF metric indicates that the largest separation between microbial com- munities is spatial proximity to the root (PCo 1) and the second largest source of variation is soil type (PCo 2). (D) Histograms of phyla abundances in each compartment and soil. B, bulk soil; E, endosphere; P, rhizoplane; S, rhizosphere; Sac, Sacramento. 2 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1414592112 gate the relationship between rice ge- icrobiome, domesticated rice varieties rated growing regions were tested. Six spanning two species within the Oryza 2 d in the greenhouse before sampling. a) cultivars M104, Nipponbare (both ties), IR50, and 93-11 (both indica va- gside two cultivars of African cultivated g7102 (Glab B) and TOg7267 (Glab E). ed that rice genotype accounted for ariation between microbial communities % of the variance, P < 0.001; Dataset f the variance, P < 0.066; Dataset S5H); ntations for clustering patterns of the nt on the first two axes of unconstrained ppendix, Fig. S10). We then used CAP ffect of rice genotype on the microbial g on rice cultivar and controlling for and technical factors, we found that ge- ice have a significant effect on root- mmunities (5.1%, P = 0.005, WUF, Fig. , UUF, SI Appendix, Fig. S11A). Ordi- AP analysis revealed clustering patterns only partially consistent with genetic F and UUF metrics. The two japonica her and the two O. glaberrima cultivars ver, the indica cultivars were split, with O. glaberrima cultivars and IR50 clus- cultivars. enotypic effect manifests in individual eparated the whole dataset to focus on vidually and conducted CAP analysis and technical factors. The rhizosphere eight sites were operated under two cultivation practices: organic cultivation and a more conventional cultivation practice termed “ecofarming” (see below). Because genotype explained the least variance in the greenhouse data, we limited the analysis to one cultivar, S102, a California temperate japonica variety that is widely cultivated by commercial growers and is closely related to M104 (26). Field samples were collected from vegetatively growing rice plants in flooded fields and the previously defined rhizocompartments were analyzed as before. Unfortunately, collection of bulk soil controls for the field experiment was not Fig. 3. Host plant genotype significantly affects microbial communities in the rhizospheric compartments. (A) Ordination of CAP analysis using the WUF metric constrained to rice genotype. (B) Within-sample diversity measurements of rhizosphere samples of each cultivar grown in each soil. Estimated species richness was calculated as eShannon_entropy . The horizontal bars within boxes represent median. The tops and bottoms of boxes repre- sent 75th and 25th quartiles, respectively. The upper and lower whiskers extend 1.5× the interquartile range from the upper edge and lower edge of the box, respectively. All outliers are plotted as individual points. oi/10.1073/pnas.1414592112 Edwards et al. fields are too high to find representative soil that is unlikely to be affected by nearby plants. Amplification and sequencing of the field microbiome samples yielded 13,349,538 high-quality sequences (median: 54,069 reads per sample; range: 12,535– 148,233 reads per sample; Dataset S13). The sequences were clustered into OTUs using the same criteria as the greenhouse experiment, yielding 222,691 microbial OTUs and 47,983 OTUs with counts >5 across the field dataset. We found that the microbial diversity of field rice plants is significantly influenced by the field site. α-Diversity measure- ments of the field rhizospheres indicated that the cultivation site significantly impacts microbial diversity (SI Appendix, Fig. S14A, P = 2.00E-16, ANOVA and Dataset S14). Unconstrained PCoA using both the WUF and UUF metrics showed that microbial communities separated by field site across the first axis (Fig. 4B, WUF and SI Appendix, Fig. S14B, UUF). PERMANOVA agreed with the unconstrained PCoA in that field site explained the largest proportion of variance between the microbial communi- ties for field plants (30.4% of variance, P < 0.001, WUF, Dataset S5O and 26.6% of variance, P < 0.001, UUF, Dataset S5P). CAP analysis constrained to field site and controlled for rhizocom- partment, cultivation practice, and technical factors (sequencing batch and biological replicate) agreed with the PERMANOVA results in that the field site explains the largest proportion of variance between the root-associated microbial communities in field plants (27.3%, P = 0.005, WUF, SI Appendix, Fig. S15A and 28.9%, P = 0.005, UUF, SI Appendix, Fig. S15E), sug- gesting that geographical factors may shape root-associated microbial communities. Rhizospheric Compartmentalization Is Retained in Field Plants. Sim- ilar to the greenhouse plants, the rhizospheric microbiomes of field plants are distinguishable by compartment. α-Diversity of the field plants again showed that the rhizosphere had the highest microbial diversity, whereas the endosphere had the least S15). PCoA the WUF a compartmen Appendix, F separation i ond largest (20.76%, P UUF, Data biomes cons trolled for f agreed with variance bet compartmen and 10.9%, Taxonomi overall sim Chloroflexi, microbiota. endosphere Proteobacteri and Plancto distribution trend from t Appendix, Fi We again OTUs in the S16). We fo endosphere c representing Fig. S17). Th the genus A and Alphap terestingly, 1 found to b greenhouse OTUs were sisted of tax and Myxoco bidopsis roo Cultivation Pr The rice fiel practices, org tion called farming in th are all perm harvest fumi itself does si partments ov a significant the rhizocom indicating th affected diffe the rhizosph practice, with zospheres th Dataset S14) crobial comm tests; Datase practices are the WUF m S14D). PERFig. 4. Root-associated microbiomes from field-grown plants are separable by cultivation site, rhizospheric compartment, and cultivation practice. (A) Variation w/in Plant Cultivation Site Effects Rice Genotype Effects and mitochondrial) reads to analyze microbial abundance in the endosphere over time (Fig. 6A). Using this technique, we confirmed the sterility of seedling roots before transplantation. We found that microbial penetrance into the endosphere oc- curred at or before 24 h after transplantation and that the pro- portion of microbial reads to organellar reads increased over the first 2 wk after transplantation (Fig. 6A). To further support the evidence for microbiome acquisition within the first 24 h, we sampled root endospheric microbiomes from sterilely germi- nated seedlings before transplanting into Davis field soil as well as immediately after transplantation and 24 h after transplan- tation (SI Appendix, Fig. S24). The root endospheres of sterilely germinated seedlings, as well as seedlings transplanted into Davis field soil for 1 min, both had a very low percentage of microbial reads compared with organellar reads (0.22% and 0.71%), with the differences not statistically significant (P = 0.1, Wilcoxon test). As before, endospheric microbial abundance increased significantly, by >10-fold after 24 h in field soil (3.95%, P = 0.05, Wilcoxon test). We conclude that brief soil contact does not strongly increase the proportion of microbial reads, and therefore the increase in microbial reads at 24 h is indicative of endophyte acquisition within 1 d after transplantation. α-Diversity significantly varied by rhizocompartment (P < 2E- 16; Dataset S23) and there was a significant interaction between rhizocompartment and collection time (P = 0.042; Dataset S23); however, when each rhizocompartment was analyzed individ- ually, the bulk soil was the only compartment that showed (13 d) approach the endosphere and rhizoplane microbiome compositions for plants that have been grown in the green- house for 42 d. There are slight shifts in the distribution of phyla over time; however, there are significant distinctions between the com- partments starting as early as 24 h after transplantation into soil (Fig. 6D, SI Appendix, Figs. S24B and S26, and Dataset S24). Because each phylum consists of diverse OTUs that could ex- hibit very different behaviors during acquisition, we next ex- amined the dynamics and colonization patterns of specific OTUs within the time-course experiment. The core set of 92 endosphere-enriched OTUs obtained from the previous green- house experiment (SI Appendix, Fig. S9C) was analyzed for relative abundances at different time points (Fig. 6E). Of the 92 core endosphere-enriched microbes present in the greenhouse experiment, 53 OTUs were detectable in the endosphere in the time-course experiment. The average abundance profile over time revealed a colonization pattern for the core endospheric microbiome. Relative abundance of the core endosphere- enriched microbiome peaks early (3 d) in the rhizosphere and then decreases back to a steady, low level for the remainder of the time points. Similarly, the rhizoplane profile shows an in- crease after 3 d with a peak at 8 d with a decline at 13 d. The endosphere generally follows the rhizoplane profile, except that relative abundance is still increasing at 13 d. These results sug- gest that the core endospheric microbes are first attracted to the rhizosphere and then locate to the rhizoplane, where they attach Fig. 5. OTU coabundance network reveals modules of OTUs associated with methane cycling. (A) Subset of the entire network corresponding to 11 modules with methane cycling potential. Each node represents one OTU and an edge is drawn between OTUs if they share a Pearson correlation of greater than or equal to 0.6. (B) Depiction of module 119 showing the relationship between methanogens, syntrophs, methanotrophs, and other methane cycling taxonomies. Each node represents one OTU and is labeled by the presumed function of that OTU’s taxonomy in methane cycling. An edge is drawn between two OTUs if they have a Pearson correlation of greater than or equal to 0.6. (C) Mean abundance profile for OTUs in module 119 across all rhizocompartments and field sites. The position along the x axis corresponds to a different field site. Error bars represent SE. The x and y axes represent no particular scale. PLANTBIOLOGYPNASPLUS Function x Genotype of magnitude greater than in any single plant species to date. Under controlled greenhouse conditions, the rhizocompartments described the largest source of variation in the microbial com- munities sampled (Dataset S5A). The pattern of separation be- tween the microbial communities in each compartment is consistent with a spatial gradient from the bulk soil across the rhizosphere and rhizoplane into the endosphere (Fig. 1C). Similarly, microbial diversity patterns within samples hold the same pattern where there is a gradient in α-diversity from the rhizosphere to the endosphere (Fig. 1B). Enrichment and de- pletion of certain microbes across the rhizocompartments indi- cates that microbial colonization of rice roots is not a passive process and that plants have the ability to select for certain mi- crobial consortia or that some microbes are better at filling the root colonizing niche. Similar to studies in Arabidopsis, we found that the relative abundance of Proteobacteria is increased in the endosphere compared with soil, and that the relative abundances of Acidobacteria and Gemmatimonadetes decrease from the soil to the endosphere (9–11), suggesting that the distribution of different bacterial phyla inside the roots might be similar for all land plants (Fig. 1D and Dataset S6). Under controlled green- house conditions, soil type described the second largest source of variation within the microbial communities of each sample. However, the soil source did not affect the pattern of separation between the rhizospheric compartments, suggesting that the rhizocompartments exert a recruitment effect on microbial con- sortia independent of the microbiome source. By using differential OTU abundance analysis in the com- partments, we observed that the rhizosphere serves an enrich- ment role for a subset of microbial OTUs relative to bulk soil (Fig. 2). Further, the majority of the OTUs enriched in the rhizosphere are simultaneously enriched in the rhizoplane and/or endosphere of rice roots (Fig. 2B and SI Appendix, Fig. S16B), consistent with a recruitment model in which factors produced by the root attract taxa that can colonize the endosphere. We found that the rhizoplane, although enriched for OTUs that are also Time Series
  • 75. Acknowledgements DOE JGI Sloan GBMF NSF DHS DARPA Aaron Darling
 Lizzy Wilbanks Jenna Lang Russell Neches Rob Knight Jack Gilbert Tanja Woyke Rob Dunn Katie Pollard Jessica Green Darlene Cavalier Eddy RubinWendy Brown Dongying Wu Phil Hugenholtz DSMZ Sundar Srijak Bhatnagar David Coil Alex Alexiev Hannah Holland-Moritz Holly Bik John Zhang Holly Menninger Guillaume Jospin David Lang Cassie Ettinger Tim HarkinsJennifer Gardy Holly Ganz