SlideShare a Scribd company logo
Complex metagenome
assembly, + bonus
career thoughts.
C. Titus Brown
UC Davis
ctbrown@ucdavis.edu
2015 aem-grs-keynote
Hello!
Research background:
Computing, modeling, and data analysis: 1989-2000
(high school & undergrad+)
Molecular biology, genomics, and data analysis:
2000-2007
(grad school + postdoc)
Bioinformatics, data analysis, and Comp Sci: 2007-
present
(assistant professor)
Genomics and Veterinary Medicine (?)
Two topics for this talk:
1. Metagenome assembly.
2. Careers & a “middle class” of
bioinformaticians.
Shotgun metagenomics
Collect samples;
Extract DNA;
Feed into sequencer;
Computationally analyze.
Wikipedia: Environmental shotgun
sequencing.png
To assemble, or not to
assemble?
Goals: reconstruct phylogenetic content and predict
functional potential of ensemble.
Should we analyze short reads directly?
OR
Do we assemble short reads into longer contigs first, and
then analyze the contigs?
Assembly: good.
Howe et al., 2014, PMID 24632729
Assemblies yield much
more significant
homology matches.
But!
Does assembly work well!?
(Short reads, chimerism, strain
variation, coverage, compute
resources, etc. etc.)
Yes: metagenome assemblers
recover the majority of known
content from a mock community.
Velvet IDBA Spades
Total length (>= 0 bp) 1.6E+08 2.0E+08 2.0E+08
Total length (>= 1000 bp) 1.6E+08 1.9E+08 1.9E+08
Largest contig 561,449 979,948 1,387,918
# misassembled contigs 631 1032 752
Genome fraction (%) 72.949 90.969 90.424
Duplication ratio 1.004 1.007 1.004
Results: Dr. Sherine AwadReads from Shakya et al., 2013; pmid 23387867
But!
A study of the Rifle site comparing long read
(Moleculo/TruSeq) and short read/assembly content
concluded that their short read assembly was not
comprehensive.
“Low rate of read mapping (18-30%) is typically indicative of
complex communities with a large number of low
abundance genomes or with high degree of
species and strain variations.”
Sharon et al., Banfield lab; PMID 25665577
The dirty not-so-secret (?)
about sequence assembly:
The assembler will simply discard two types of data.
1. Low coverage data - can’t be reconstructed with
confidence; may be erroneous.
2. Highly polymorphic data – confuses the assembler.
So: why didn’t the Rifle data
assemble?
There are no published approaches that will
discriminate between low coverage and strain
variation.
But we’ve known about this problem for ages.
So we’ve been working with something called
“assembly graphs”.
Assembly graphs.
Assembly “graphs” are a way of representing
sequencing data that retains all variation;
assemblers then crunch this down into a single
sequence.
Image from Iqbal et al., 2012. PMID 23172865
Our work on assembly graphs
enables:
Evaluation of data set coverage profiles prior to assembly.
Variant calling and quantification on raw metagenomic
data.
Analysis of strain variation.
Evaluation of “what’s in my reads but not in my assembly”.
(See http://ivory.idyll.org/blog/2015-wok-notes.html
for details.)
Rifle: Low coverage? (Yes.)
Assembly starts to work @ ~10x
Rifle: strain variation? (Maybe.)
ATTCGTCGATTGGCAAAAGTTCTTTCCAGAGCCTACGGGAGAAGTGTA
|||||||||||||||||||||||||||||||| |||||||||||||||
ATTCGTCGATTGGCAAAAGTTCTTTCCAGAGCTTACGGGAGAAGTGTA
GTCAAAATAAGGTGAGGTTGCTAATCCTCGAACTTTTCAC
||||||||||||||| |||||||| ||||||| |||||||
GTCAAAATAAGGTGAAGTTGCTAACCCTCGAATTTTTCAC
A typical subalignment between all short reads & one long read:
If we saw many medium-high coverage alignments with this
level of variation, => strain variation.
My thoughts on metagenome
assembly & Rifle data:
The Rifle short-read data is low coverage, based
on both indirect (in paper) and direct (our)
observations. This is the first reason why it didn’t
assemble well.
Strain variation is also present, within the limits of
low coverage analysis. That will cause problems
in future
=> Your methods limit and bias your results.
The problem:
Assembly graphs are coming to all of
genomics.
Because they are fundamentally different
they require a completely new bioinformatics
tool chain. (They don’t use FASTA…)
For better or for worse, us bioinformaticians
are not going to write tools that are easy to
use.
It’s hard;
There’s little incentive;
The tool/application needs are incredibly
Who ya gonna call??
…to do your bioinformatics?
Choices
(1) Focus on biology and avoid computation as much as
possible.
(2) Integrate large scale data analysis into your biology.
(3) Become purely computational
https://commons.wikimedia.org/wiki/File:Three_options_-_three_choices_scheme.png
Choices
(1) Focus on biology and avoid computation as much as
possible.
(2) Integrate large scale data analysis into your biology.
(3) Become purely computational.
Towards a “bioinformatics
middle class”
Most bioinformaticians are quite ignorant of the biology
you’re doing; biologists are often more aware of the
bioinformatics they’re using.
There is amazing opportunity at the intersection of
biology and computing.
I think of it as a “bioinformatics middle class” –
biologists who are comfortable with computing, and
deploy large scale data analysis in the service of their
biological work.
Towards a “bioinformatics
middle class”
We need many more biologists who have an
intuitive & deep understanding of the
computing.
Such people are rare, and there is no defined
“pipeline” for them. Training must be self-
motivated.
(And higher ed has really abdicated its
responsibilities in this area.)
My top four suggestions
(more at end)
1. Don’t avoid computing; embrace it.
2. Invest in the Internet and social media
(blogs, Twitter) – seqanswers, biostars, etc.
3. Be patient and aware of the time it takes time
to effectively cross-train.
4. Seek out formal training opportunities.
If you’re a senior scientist, or
know any:
Ask them to lobby for funding at this
intersection.
Ask them to lobby for good (nay, excellent)
funding for training opportunities.
Make sure they respect the challenges and
opportunities of large scale data analysis and
modeling (along with those who do it).
Career benefits of doing large-
scale data analysis.
Alternative career paths (i.e. “jobs actually exist in this
area.”)
Flexibility in work hours & location.
Work with an even broader diversity of people and
projects.
Dangers:
It’s easy to get caught up in the computing and ignore
the biology!
…but right now training & culture are tilted too much
towards experimental and field research, which
presents its own problems in a data-intensive era of
research.
What’s coming?
Lots more data.
Where am I going?
Data integration.
Figure 2. Summary of challenges associated with the data integration in the proposed project.
Figure via E. Kujawinski
An optimistic message
This is a great time to be alive and doing
research!
We can look at & try to understand
environmental microbes with many new tools
and new approaches!
The skills you need to do this extend across
disciplines, across the public and private
sectors, and cannot be automated or
outsourced!
Thank you for listening!
I’ll be here all week; really
looking forward to it!
More advice.
don’t avoid computing
teach and train what you do know; put together classes and
workshops;
host and run software and data carpentry workshops, then put
together more advanced workshops;
do hackathons or compute-focused events where you just sit
down in groups and work on data analysis.
(push admin to support all this, or just do it without your admin);
invest in the internet and social media – blogs, twitter, biostars…
take a CS prof to lunch, seek joint funding, do a sabbatical in a
purely compute lab, etc.
support open source bioinformatics software
invest in reproducibility
be aware that compute people’s time is as or more
oversubscribed as yours & prospectively value it.

More Related Content

PPTX
2016 bergen-sars
PPTX
2016 davis-biotech
PPTX
2015 genome-center
PPTX
2015 ohsu-metagenome
PPTX
2015 illinois-talk
PPTX
2014 bangkok-talk
PPTX
2014 sage-talk
PPT
2013 pag-equine-workshop
2016 bergen-sars
2016 davis-biotech
2015 genome-center
2015 ohsu-metagenome
2015 illinois-talk
2014 bangkok-talk
2014 sage-talk
2013 pag-equine-workshop

What's hot (20)

PPTX
2016 davis-plantbio
PPTX
2013 talk at TGAC, November 4
PPTX
2014 marine-microbes-grc
PPTX
2013 nas-ehs-data-integration-dc
PPTX
2014 anu-canberra-streaming
PPT
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
PPTX
2015 pag-metagenome
PDF
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
ODP
Next-generation sequencing: Data mangement
PDF
Advanced Bioinformatics for Genomics and BioData Driven Research
PDF
2014 11-13-sbsm032-reproducible research
PPTX
Ngs de novo assembly progresses and challenges
ODP
Life sciences big data use cases
PDF
Genome Big Data
ODP
Next generation genomics: Petascale data in the life sciences
PDF
Towards Incidental Collaboratories; Research Data Services
ODP
Future Architectures for genomics
PDF
Drug Repurposing using Deep Learning on Knowledge Graphs
PPTX
Practical Guide to the $1000 Genome (2014)
PPTX
HPCAC - the state of bioinformatics in 2017
2016 davis-plantbio
2013 talk at TGAC, November 4
2014 marine-microbes-grc
2013 nas-ehs-data-integration-dc
2014 anu-canberra-streaming
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
2015 pag-metagenome
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
Next-generation sequencing: Data mangement
Advanced Bioinformatics for Genomics and BioData Driven Research
2014 11-13-sbsm032-reproducible research
Ngs de novo assembly progresses and challenges
Life sciences big data use cases
Genome Big Data
Next generation genomics: Petascale data in the life sciences
Towards Incidental Collaboratories; Research Data Services
Future Architectures for genomics
Drug Repurposing using Deep Learning on Knowledge Graphs
Practical Guide to the $1000 Genome (2014)
HPCAC - the state of bioinformatics in 2017
Ad

Viewers also liked (20)

PPT
How to back up files online?
PDF
Tendencias En Comunicacion Digital Eyeblaster Oded Lida Ded09
PPTX
Creditmanagement en cloud computing
PPT
Informationsleder Jane Kruse
PPT
Romairone, Gregorio
PDF
Virtualizing the Next Generation of Server Workloads with AMD™
PPT
Morsø erhversråd energimærkning
PPT
Ashleigh and Sarah's: Killer Whales
PDF
Stordfest2010 Festival paper
PPTX
iPOJO 2.x - a tale about dynamism
PPTX
2013 arizona-swc
PPT
Passivhuse: Udfordringer og muligheder
PPTX
2015 pycon-talk
PDF
OSHA Goes On the Attack as the Obama Administration Winds Down: Are You Prepa...
PPTX
Interactive NETS*S Workshop, ISTE 2011
PPT
DNA的天羅地網
PPS
The beauty-of-mathematics
PPTX
Alcohol # 1 concern march 16 2016
PPT
Coke
PPTX
Cloudxp keynote 19 sept pvu
How to back up files online?
Tendencias En Comunicacion Digital Eyeblaster Oded Lida Ded09
Creditmanagement en cloud computing
Informationsleder Jane Kruse
Romairone, Gregorio
Virtualizing the Next Generation of Server Workloads with AMD™
Morsø erhversråd energimærkning
Ashleigh and Sarah's: Killer Whales
Stordfest2010 Festival paper
iPOJO 2.x - a tale about dynamism
2013 arizona-swc
Passivhuse: Udfordringer og muligheder
2015 pycon-talk
OSHA Goes On the Attack as the Obama Administration Winds Down: Are You Prepa...
Interactive NETS*S Workshop, ISTE 2011
DNA的天羅地網
The beauty-of-mathematics
Alcohol # 1 concern march 16 2016
Coke
Cloudxp keynote 19 sept pvu
Ad

Similar to 2015 aem-grs-keynote (20)

PPT
Bms 2010
PPTX
Making powerful science: an introduction to NGS data analysis
PPTX
2014 abic-talk
PPTX
2013 bms-retreat-talk
PPTX
2015 osu-metagenome
PPTX
Big Data Field Museum
PPTX
2014 toronto-torbug
PPTX
2015 beacon-metagenome-tutorial
PPTX
2015 mcgill-talk
PPTX
2013 duke-talk
PPTX
BEACON 101: Sequencing tech
PDF
2013 stamps-assembly-methods.pptx
ODP
The roles communities play in improving bioinformatics: better software, bett...
PPTX
Cloud bioinformatics 2
PDF
Introduction to Bioinformatics
PPTX
Bioinformatics and its applications-converted.pptx
PPT
bbdt.ggggggggggggggggggggggggggggggggggggggggggggppt
PPTX
2014 nicta-reproducibility
PPT
Introduction to Bioinformatics_BTMB_2018.ppt
PPT
Introduction to Bioinformatics_BTMB_2018.ppt
Bms 2010
Making powerful science: an introduction to NGS data analysis
2014 abic-talk
2013 bms-retreat-talk
2015 osu-metagenome
Big Data Field Museum
2014 toronto-torbug
2015 beacon-metagenome-tutorial
2015 mcgill-talk
2013 duke-talk
BEACON 101: Sequencing tech
2013 stamps-assembly-methods.pptx
The roles communities play in improving bioinformatics: better software, bett...
Cloud bioinformatics 2
Introduction to Bioinformatics
Bioinformatics and its applications-converted.pptx
bbdt.ggggggggggggggggggggggggggggggggggggggggggggppt
2014 nicta-reproducibility
Introduction to Bioinformatics_BTMB_2018.ppt
Introduction to Bioinformatics_BTMB_2018.ppt

More from c.titus.brown (14)

PPTX
2015 msu-code-review
PPTX
2015 opencon-webcast
PPTX
2015 vancouver-vanbug
PPTX
2015 balti-and-bioinformatics
PPTX
2015 pag-chicken
PPTX
2014 nyu-bio-talk
PPTX
2014 aus-agta
PPTX
2014 mmg-talk
PPTX
2014 nci-edrn
PPTX
2014 wcgalp
PPTX
2014 moore-ddd
PPTX
2014 ismb-extra-slides
PPTX
2014 bosc-keynote
PPTX
2014 ucl
2015 msu-code-review
2015 opencon-webcast
2015 vancouver-vanbug
2015 balti-and-bioinformatics
2015 pag-chicken
2014 nyu-bio-talk
2014 aus-agta
2014 mmg-talk
2014 nci-edrn
2014 wcgalp
2014 moore-ddd
2014 ismb-extra-slides
2014 bosc-keynote
2014 ucl

Recently uploaded (20)

DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
famous lake in india and its disturibution and importance
PDF
An interstellar mission to test astrophysical black holes
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PPTX
2. Earth - The Living Planet earth and life
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
HPLC-PPT.docx high performance liquid chromatography
PPT
Chemical bonding and molecular structure
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPT
protein biochemistry.ppt for university classes
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
7. General Toxicologyfor clinical phrmacy.pptx
famous lake in india and its disturibution and importance
An interstellar mission to test astrophysical black holes
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
2. Earth - The Living Planet earth and life
TOTAL hIP ARTHROPLASTY Presentation.pptx
HPLC-PPT.docx high performance liquid chromatography
Chemical bonding and molecular structure
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
protein biochemistry.ppt for university classes
Comparative Structure of Integument in Vertebrates.pptx
Introduction to Fisheries Biotechnology_Lesson 1.pptx
INTRODUCTION TO EVS | Concept of sustainability
Derivatives of integument scales, beaks, horns,.pptx
microscope-Lecturecjchchchchcuvuvhc.pptx
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice

2015 aem-grs-keynote

  • 1. Complex metagenome assembly, + bonus career thoughts. C. Titus Brown UC Davis ctbrown@ucdavis.edu
  • 3. Hello! Research background: Computing, modeling, and data analysis: 1989-2000 (high school & undergrad+) Molecular biology, genomics, and data analysis: 2000-2007 (grad school + postdoc) Bioinformatics, data analysis, and Comp Sci: 2007- present (assistant professor) Genomics and Veterinary Medicine (?)
  • 4. Two topics for this talk: 1. Metagenome assembly. 2. Careers & a “middle class” of bioinformaticians.
  • 5. Shotgun metagenomics Collect samples; Extract DNA; Feed into sequencer; Computationally analyze. Wikipedia: Environmental shotgun sequencing.png
  • 6. To assemble, or not to assemble? Goals: reconstruct phylogenetic content and predict functional potential of ensemble. Should we analyze short reads directly? OR Do we assemble short reads into longer contigs first, and then analyze the contigs?
  • 7. Assembly: good. Howe et al., 2014, PMID 24632729 Assemblies yield much more significant homology matches.
  • 8. But! Does assembly work well!? (Short reads, chimerism, strain variation, coverage, compute resources, etc. etc.)
  • 9. Yes: metagenome assemblers recover the majority of known content from a mock community. Velvet IDBA Spades Total length (>= 0 bp) 1.6E+08 2.0E+08 2.0E+08 Total length (>= 1000 bp) 1.6E+08 1.9E+08 1.9E+08 Largest contig 561,449 979,948 1,387,918 # misassembled contigs 631 1032 752 Genome fraction (%) 72.949 90.969 90.424 Duplication ratio 1.004 1.007 1.004 Results: Dr. Sherine AwadReads from Shakya et al., 2013; pmid 23387867
  • 10. But! A study of the Rifle site comparing long read (Moleculo/TruSeq) and short read/assembly content concluded that their short read assembly was not comprehensive. “Low rate of read mapping (18-30%) is typically indicative of complex communities with a large number of low abundance genomes or with high degree of species and strain variations.” Sharon et al., Banfield lab; PMID 25665577
  • 11. The dirty not-so-secret (?) about sequence assembly: The assembler will simply discard two types of data. 1. Low coverage data - can’t be reconstructed with confidence; may be erroneous. 2. Highly polymorphic data – confuses the assembler.
  • 12. So: why didn’t the Rifle data assemble? There are no published approaches that will discriminate between low coverage and strain variation. But we’ve known about this problem for ages. So we’ve been working with something called “assembly graphs”.
  • 13. Assembly graphs. Assembly “graphs” are a way of representing sequencing data that retains all variation; assemblers then crunch this down into a single sequence. Image from Iqbal et al., 2012. PMID 23172865
  • 14. Our work on assembly graphs enables: Evaluation of data set coverage profiles prior to assembly. Variant calling and quantification on raw metagenomic data. Analysis of strain variation. Evaluation of “what’s in my reads but not in my assembly”. (See http://ivory.idyll.org/blog/2015-wok-notes.html for details.)
  • 15. Rifle: Low coverage? (Yes.) Assembly starts to work @ ~10x
  • 16. Rifle: strain variation? (Maybe.) ATTCGTCGATTGGCAAAAGTTCTTTCCAGAGCCTACGGGAGAAGTGTA |||||||||||||||||||||||||||||||| ||||||||||||||| ATTCGTCGATTGGCAAAAGTTCTTTCCAGAGCTTACGGGAGAAGTGTA GTCAAAATAAGGTGAGGTTGCTAATCCTCGAACTTTTCAC ||||||||||||||| |||||||| ||||||| ||||||| GTCAAAATAAGGTGAAGTTGCTAACCCTCGAATTTTTCAC A typical subalignment between all short reads & one long read: If we saw many medium-high coverage alignments with this level of variation, => strain variation.
  • 17. My thoughts on metagenome assembly & Rifle data: The Rifle short-read data is low coverage, based on both indirect (in paper) and direct (our) observations. This is the first reason why it didn’t assemble well. Strain variation is also present, within the limits of low coverage analysis. That will cause problems in future => Your methods limit and bias your results.
  • 18. The problem: Assembly graphs are coming to all of genomics. Because they are fundamentally different they require a completely new bioinformatics tool chain. (They don’t use FASTA…) For better or for worse, us bioinformaticians are not going to write tools that are easy to use. It’s hard; There’s little incentive; The tool/application needs are incredibly
  • 19. Who ya gonna call?? …to do your bioinformatics?
  • 20. Choices (1) Focus on biology and avoid computation as much as possible. (2) Integrate large scale data analysis into your biology. (3) Become purely computational https://commons.wikimedia.org/wiki/File:Three_options_-_three_choices_scheme.png
  • 21. Choices (1) Focus on biology and avoid computation as much as possible. (2) Integrate large scale data analysis into your biology. (3) Become purely computational.
  • 22. Towards a “bioinformatics middle class” Most bioinformaticians are quite ignorant of the biology you’re doing; biologists are often more aware of the bioinformatics they’re using. There is amazing opportunity at the intersection of biology and computing. I think of it as a “bioinformatics middle class” – biologists who are comfortable with computing, and deploy large scale data analysis in the service of their biological work.
  • 23. Towards a “bioinformatics middle class” We need many more biologists who have an intuitive & deep understanding of the computing. Such people are rare, and there is no defined “pipeline” for them. Training must be self- motivated. (And higher ed has really abdicated its responsibilities in this area.)
  • 24. My top four suggestions (more at end) 1. Don’t avoid computing; embrace it. 2. Invest in the Internet and social media (blogs, Twitter) – seqanswers, biostars, etc. 3. Be patient and aware of the time it takes time to effectively cross-train. 4. Seek out formal training opportunities.
  • 25. If you’re a senior scientist, or know any: Ask them to lobby for funding at this intersection. Ask them to lobby for good (nay, excellent) funding for training opportunities. Make sure they respect the challenges and opportunities of large scale data analysis and modeling (along with those who do it).
  • 26. Career benefits of doing large- scale data analysis. Alternative career paths (i.e. “jobs actually exist in this area.”) Flexibility in work hours & location. Work with an even broader diversity of people and projects.
  • 27. Dangers: It’s easy to get caught up in the computing and ignore the biology! …but right now training & culture are tilted too much towards experimental and field research, which presents its own problems in a data-intensive era of research.
  • 29. Where am I going? Data integration. Figure 2. Summary of challenges associated with the data integration in the proposed project. Figure via E. Kujawinski
  • 30. An optimistic message This is a great time to be alive and doing research! We can look at & try to understand environmental microbes with many new tools and new approaches! The skills you need to do this extend across disciplines, across the public and private sectors, and cannot be automated or outsourced!
  • 31. Thank you for listening! I’ll be here all week; really looking forward to it!
  • 32. More advice. don’t avoid computing teach and train what you do know; put together classes and workshops; host and run software and data carpentry workshops, then put together more advanced workshops; do hackathons or compute-focused events where you just sit down in groups and work on data analysis. (push admin to support all this, or just do it without your admin); invest in the internet and social media – blogs, twitter, biostars… take a CS prof to lunch, seek joint funding, do a sabbatical in a purely compute lab, etc. support open source bioinformatics software invest in reproducibility be aware that compute people’s time is as or more oversubscribed as yours & prospectively value it.

Editor's Notes

  • #2: Tweet; funding; affiliation.
  • #25: Nothing in this life that is worth doing is *easy*.