SlideShare a Scribd company logo
Data-intensive approaches to
investigating non-model
organisms
C. Titus Brown
ctb@msu.edu
Assistant Professor
Microbiology and Molecular Genetics; Computer Science and Engineering;
BEACON; Quantitative Biology Initiative
Outline
• My research!
• Opportunities for computational science training
• More unsolicited advice
Acknowledgements
Lab members involved Collaborators
• Adina Howe (w/Tiedje)
• Jason Pell
• Arend Hintze
• Rosangela Canino-Koning
• Qingpeng Zhang
• Elijah Lowe
• Likit Preeyanon
• Jiarong Guo
• Tim Brom
• Kanchan Pavangadkar
• Eric McDonald
• Jim Tiedje, MSU
• Erich Schwarz, Caltech / Cornell
• Paul Sternberg, Caltech
• Robin Gasser, U. Melbourne
• Weiming Li
• Hans Cheng
Funding
USDA NIFA; NSF IOS;
BEACON; NIH.
My interests
I work primarily on organisms of agricultural, evolutionary, or
ecological importance, which tend to have poor reference
genomes and transcriptomes. Focus on:
• Improving assembly sensitivity to better recover
genomic/transcriptomic sequence, often from “weird”
samples.
• Scaling sequence assembly approaches so that huge
assemblies are possible and big assemblies are
straightforward.
• “Better science through superior software”
There is quite a bit of life left to sequence & assemble.
http://pacelab.colorado.edu/
“Weird” biological samples:
• Single genome
• Transcriptome
• High polymorphism data
• Whole genome amplified
• Metagenome (mixed
microbial community)
• Hard to sequence DNA
(e.g. GC/AT bias)
• Differential expression!
• Multiple alleles
• Often extreme
amplification bias
• Differential abundance
within community.
Single genome assembly is already
challenging --
Once you start sequencing
metagenomes…
DNA sequencing
• Observation of actual DNA sequence
• Counting of molecules
Image: Werner Van Belle
Fast, cheap, and easy to
generate.
Image: Werner Van Belle
New problem: data analysis &
integration!
• Once you can generate virtually any data set you want…
• …the next problem becomes finding your answer in the data
set!
• Think of it as a gigantic NSA treasure hunt: you know there are
terrorists out there, but to find them you to hunt through 1 bn
phone calls a day…
“Heuristics”
• What do computers do when the answer is either really, really
hard to compute exactly, or actually impossible?
• They approximate! Or guess!
• The term “heuristic” refers to a guess, or shortcut
procedure, that usually returns a pretty good answer.
Oftenexplicitor implicittradeoffs between
compute“amount”and quality of result
http://www.infernodevelopment.com/how-
computer-chess-engines-think-minimax-tree
My actual research focus
What we do is think about ways to get computers to play chess
better, by:
• Identifying better ways to guess;
• Speeding up the guessing process;
• Improving people’s ability to use the chess playing computer
Now, replace “play chess” with
“analyze biological data”...
My actual research focus…
We build tools that help experimental biologists work efficiently
and correctly with large amounts of data, to help answer their
scientific questions.
This touches on many problems, including:
• Computational and scientific correctness.
• Computational efficiency.
• Cultural divides between experimental biologists and
computational scientists.
• Lack of training (biology and medical curricula devoid of math
and computing).
Not-so-secretsauce:“digitalnormalization”
• One primary step of one type of data analysis becomes 20-200x
faster, 20-150x “cheaper”.
Approach: Digital normalization
(acomputationalversionoflibrarynormalization)
Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume disk
space and, because of
errors, memory.
We can discard it for you…
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization approach
A digital analog to cDNA library normalization, diginorm:
• Is single pass: looks at each read only once;
• Does not “collect” the majority of errors;
• Keeps all low-coverage reads;
• Smooths out coverage of regions.
http://en.wikipedia.org/wiki/JPEG
Lossy compression
http://en.wikipedia.org/wiki/JPEG
Lossy compression
http://en.wikipedia.org/wiki/JPEG
Lossy compression
http://en.wikipedia.org/wiki/JPEG
Lossy compression
http://en.wikipedia.org/wiki/JPEG
Lossy compression
Raw data
(~10-100 GB)
Analysis "Information"
~1 GB
"Information"
"Information"
"Information"
"Information"
Database &
integration
Restated:
Can we use lossy compression approaches to make
downstream analysis faster and better? (Yes.)
~2 GB – 2 TB of single-chassis RAM
Soil metagenome assembly
• Observation: 99% of microbes cannot easily be cultured in the
lab. (“The great plate count anomaly”)
• Many reasons why you can’t or don’t want to culture:
• Syntrophic relationships
• Niche-specificity or unknown physiology
• Dormant microbes
• Abundance within communities
Single-cell sequencing & shotgun metagenomics are two common
ways to investigate microbial communities.
Investigating soil microbial ecology
• What ecosystem level functions are present, and how do
microbes do them?
• How does agricultural soil differ from native soil?
• How does soil respond to climate perturbation?
• Questions that are not easy to answer without shotgun
sequencing:
• What kind of strain-level heterogeneity is present in the
population?
• What does the phage and viral population look like?
• What species are where?
SAMPLING LOCATIONS
A “Grand Challenge” dataset
(DOE/JGI)
0
100
200
300
400
500
600
Iowa,
Continuous
corn
Iowa, Native
Prairie
Kansas,
Cultivated
corn
Kansas,
Native
Prairie
Wisconsin,
Continuous
corn
Wisconsin,
Native
Prairie
Wisconsin,
Restored
Prairie
Wisconsin,
Switchgrass
BasepairsofSequencing(Gbp)
GAII HiSeq
Rumen (Hess et. al, 2011), 268 Gbp
MetaHIT (Qin et. al, 2011), 578 Gbp
NCBI nr database,
37 Gbp
Total: 1,846 Gbp soil metagenome
Rumen K-mer Filtered,
111 Gbp
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp
Assemblyresults for Iowacorn and prairie
(2x~300Gbpsoilmetagenomes)
Total
Assembly
Total Contigs
(> 300 bp)
% Reads
Assembled
Predicted
protein
coding
2.5 bill 4.5 mill 19% 5.3 mill
3.5 bill 5.9 mill 22% 6.8 mill
Adina Howe
Strain variation?Toptwoallelefrequencies
Position within contig
Of 5000 most
abundant
contigs, only 1 has
a
polymorphism
rate > 5%
Can measure by
read mapping.
Tentative observations from our
soil samples:
• We need 100x as much data…
• Much of our sample may consist of phage.
• Phylogeny varies more than functional predictions.
• We see little to no strain variation within our samples
• Not bulk soil --
• Very small, localized, and low coverage samples
• We may be able to do selective really deep sequencing and
then infer the rest from 16s.
• Implications for soil aggregate assembly?
I also work on…
• Genome assembly & analysis
• Transcriptome assembly and analysis
• Interpretation of annoying large data sets
Whatarethetissuelevelchangesingeneexpressionthatsupportregeneration?
TranscriptomeanalysisofaregeneratingvertebrateafterSCI
brain
spinal cord
RNA-Seq to determine
differential expression
profile after injury
Sampling >weekly
-/+ Dex
Ona Bloom
Training opportunities
• PLB/MMG 810 (Shiu; ??)
• CSE 801/Intro BEACON course (Brown; FS ‘13)
“Intro to Computational Science for Evolutionary Biologists”
• CSE 801 bootcamp (late Sep)
• Software Carpentry bootcamp(s) (late Sep)
• Workshops in Applied Bioinformatics (Buell; ‘14?)
• Next-Gen Sequence Analysis Workshop (Brown; summer ‘14)
+ a variety of genomics courses that I can’t keep track of!
Becky Mansel will have these slides.
Unsolicited advice
Consider both faculty and non-faculty careers.
• It’s a bad time to be looking for faculty positions, and it’s a bad
time to be looking for funding; maybe this will improve in 10
years, maybe not.
• A PhD qualifies you for many, many more things than we will
(or can) tell you about!
• Specific advice:
• Network with industry folk; think beyond your advisor’s career.
• Write a blog: ivory.idyll.org/blog/advice-to-scientists-on-
blogging.html

More Related Content

PPTX
Munoz torres web-apollo-workshop_exeter-2014_ss
PPTX
An introduction to Web Apollo for i5K Pilot Species Projects - Hemiptera
PPTX
Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data
PPTX
Brain bits® Overview
PDF
Towards Incidental Collaboratories; Research Data Services
PPTX
2013 hmp-assembly-webinar
PPT
ASM 2013 Metagenomic Assembly Workshop Slides
PDF
Web Apollo Workshop UIUC
Munoz torres web-apollo-workshop_exeter-2014_ss
An introduction to Web Apollo for i5K Pilot Species Projects - Hemiptera
Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data
Brain bits® Overview
Towards Incidental Collaboratories; Research Data Services
2013 hmp-assembly-webinar
ASM 2013 Metagenomic Assembly Workshop Slides
Web Apollo Workshop UIUC

Viewers also liked (20)

PDF
Export Compliance: Keeping You Safe, Solvent + Out of Trouble
PPS
ciudatenii
PPTX
S1031 re 5.6.13 vt realtors 2013
PDF
How to do windows movie maker?
PPTX
Vizerra 2010
PPTX
2013 gbmf-mmi-ci
PDF
Top 5 Issues Affecting the HR Profession in Ohio
PPT
Updated-Enroll And Survey
DOCX
Evaluaciones de jheickson noguera
PDF
Deadlocks
PPT
Manduca
PPT
The Power of Section 1031 for Accounting Professionals
PDF
Company Presentation for Publishers
PPTX
Intellisoft introductionrecruitment l120219
PDF
Manual Book - Telkomsel Care Applications
PDF
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
PDF
Your Guide to Business + Legal Success in Latin America
PDF
Solar Power Purchase Agreement Contracts
PPS
MoMo Tel Aviv Israel June 2009 Mike Moore
Export Compliance: Keeping You Safe, Solvent + Out of Trouble
ciudatenii
S1031 re 5.6.13 vt realtors 2013
How to do windows movie maker?
Vizerra 2010
2013 gbmf-mmi-ci
Top 5 Issues Affecting the HR Profession in Ohio
Updated-Enroll And Survey
Evaluaciones de jheickson noguera
Deadlocks
Manduca
The Power of Section 1031 for Accounting Professionals
Company Presentation for Publishers
Intellisoft introductionrecruitment l120219
Manual Book - Telkomsel Care Applications
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
Your Guide to Business + Legal Success in Latin America
Solar Power Purchase Agreement Contracts
MoMo Tel Aviv Israel June 2009 Mike Moore
Ad

Similar to 2013 bms-retreat-talk (20)

PPTX
2013 ucdavis-smbe-eukaryotes
PPTX
2014 bangkok-talk
PPTX
2012 hpcuserforum talk
PPTX
2014 sage-talk
PPTX
2014 ucl
PPTX
2014 nyu-bio-talk
PPTX
2015 beacon-metagenome-tutorial
PPTX
2014 naples
PPTX
2013 siam-cse-big-data
PDF
PacMin @ AMPLab All-Hands
PPTX
2016 davis-plantbio
PPTX
2013 py con awesome big data algorithms
PPTX
2014 mmg-talk
PPTX
BEACON 101: Sequencing tech
PDF
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
PDF
Curate locally, think globally
PPTX
2014 villefranche
PPTX
2014 aus-agta
PPTX
2013 duke-talk
PDF
Jillian ms defense-4-14-14-ja-novid3
2013 ucdavis-smbe-eukaryotes
2014 bangkok-talk
2012 hpcuserforum talk
2014 sage-talk
2014 ucl
2014 nyu-bio-talk
2015 beacon-metagenome-tutorial
2014 naples
2013 siam-cse-big-data
PacMin @ AMPLab All-Hands
2016 davis-plantbio
2013 py con awesome big data algorithms
2014 mmg-talk
BEACON 101: Sequencing tech
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
Curate locally, think globally
2014 villefranche
2014 aus-agta
2013 duke-talk
Jillian ms defense-4-14-14-ja-novid3
Ad

More from c.titus.brown (20)

PPTX
2016 bergen-sars
PPTX
2016 davis-biotech
PPTX
2015 genome-center
PPTX
2015 aem-grs-keynote
PPTX
2015 msu-code-review
PPTX
2015 illinois-talk
PPTX
2015 mcgill-talk
PPTX
2015 pycon-talk
PPTX
2015 opencon-webcast
PPTX
2015 vancouver-vanbug
PPTX
2015 osu-metagenome
PPTX
2015 ohsu-metagenome
PPTX
2015 balti-and-bioinformatics
PPTX
2015 pag-chicken
PPTX
2015 pag-metagenome
PPTX
2014 anu-canberra-streaming
PPTX
2014 nicta-reproducibility
PPTX
2014 abic-talk
PPTX
2014 nci-edrn
PPTX
2014 wcgalp
2016 bergen-sars
2016 davis-biotech
2015 genome-center
2015 aem-grs-keynote
2015 msu-code-review
2015 illinois-talk
2015 mcgill-talk
2015 pycon-talk
2015 opencon-webcast
2015 vancouver-vanbug
2015 osu-metagenome
2015 ohsu-metagenome
2015 balti-and-bioinformatics
2015 pag-chicken
2015 pag-metagenome
2014 anu-canberra-streaming
2014 nicta-reproducibility
2014 abic-talk
2014 nci-edrn
2014 wcgalp

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Approach and Philosophy of On baking technology
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
A Presentation on Artificial Intelligence
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Electronic commerce courselecture one. Pdf
PPT
Teaching material agriculture food technology
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
cuic standard and advanced reporting.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
KodekX | Application Modernization Development
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation theory and applications.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Network Security Unit 5.pdf for BCA BBA.
Approach and Philosophy of On baking technology
CIFDAQ's Market Insight: SEC Turns Pro Crypto
A Presentation on Artificial Intelligence
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Electronic commerce courselecture one. Pdf
Teaching material agriculture food technology
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Diabetes mellitus diagnosis method based random forest with bat algorithm
cuic standard and advanced reporting.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
KodekX | Application Modernization Development
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Understanding_Digital_Forensics_Presentation.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation theory and applications.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

2013 bms-retreat-talk

  • 1. Data-intensive approaches to investigating non-model organisms C. Titus Brown ctb@msu.edu Assistant Professor Microbiology and Molecular Genetics; Computer Science and Engineering; BEACON; Quantitative Biology Initiative
  • 2. Outline • My research! • Opportunities for computational science training • More unsolicited advice
  • 3. Acknowledgements Lab members involved Collaborators • Adina Howe (w/Tiedje) • Jason Pell • Arend Hintze • Rosangela Canino-Koning • Qingpeng Zhang • Elijah Lowe • Likit Preeyanon • Jiarong Guo • Tim Brom • Kanchan Pavangadkar • Eric McDonald • Jim Tiedje, MSU • Erich Schwarz, Caltech / Cornell • Paul Sternberg, Caltech • Robin Gasser, U. Melbourne • Weiming Li • Hans Cheng Funding USDA NIFA; NSF IOS; BEACON; NIH.
  • 4. My interests I work primarily on organisms of agricultural, evolutionary, or ecological importance, which tend to have poor reference genomes and transcriptomes. Focus on: • Improving assembly sensitivity to better recover genomic/transcriptomic sequence, often from “weird” samples. • Scaling sequence assembly approaches so that huge assemblies are possible and big assemblies are straightforward. • “Better science through superior software”
  • 5. There is quite a bit of life left to sequence & assemble. http://pacelab.colorado.edu/
  • 6. “Weird” biological samples: • Single genome • Transcriptome • High polymorphism data • Whole genome amplified • Metagenome (mixed microbial community) • Hard to sequence DNA (e.g. GC/AT bias) • Differential expression! • Multiple alleles • Often extreme amplification bias • Differential abundance within community.
  • 7. Single genome assembly is already challenging --
  • 8. Once you start sequencing metagenomes…
  • 9. DNA sequencing • Observation of actual DNA sequence • Counting of molecules Image: Werner Van Belle
  • 10. Fast, cheap, and easy to generate. Image: Werner Van Belle
  • 11. New problem: data analysis & integration! • Once you can generate virtually any data set you want… • …the next problem becomes finding your answer in the data set! • Think of it as a gigantic NSA treasure hunt: you know there are terrorists out there, but to find them you to hunt through 1 bn phone calls a day…
  • 12. “Heuristics” • What do computers do when the answer is either really, really hard to compute exactly, or actually impossible? • They approximate! Or guess! • The term “heuristic” refers to a guess, or shortcut procedure, that usually returns a pretty good answer.
  • 13. Oftenexplicitor implicittradeoffs between compute“amount”and quality of result http://www.infernodevelopment.com/how- computer-chess-engines-think-minimax-tree
  • 14. My actual research focus What we do is think about ways to get computers to play chess better, by: • Identifying better ways to guess; • Speeding up the guessing process; • Improving people’s ability to use the chess playing computer Now, replace “play chess” with “analyze biological data”...
  • 15. My actual research focus… We build tools that help experimental biologists work efficiently and correctly with large amounts of data, to help answer their scientific questions. This touches on many problems, including: • Computational and scientific correctness. • Computational efficiency. • Cultural divides between experimental biologists and computational scientists. • Lack of training (biology and medical curricula devoid of math and computing).
  • 16. Not-so-secretsauce:“digitalnormalization” • One primary step of one type of data analysis becomes 20-200x faster, 20-150x “cheaper”.
  • 17. Approach: Digital normalization (acomputationalversionoflibrarynormalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory. We can discard it for you…
  • 24. Digital normalization approach A digital analog to cDNA library normalization, diginorm: • Is single pass: looks at each read only once; • Does not “collect” the majority of errors; • Keeps all low-coverage reads; • Smooths out coverage of regions.
  • 30. Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Restated: Can we use lossy compression approaches to make downstream analysis faster and better? (Yes.) ~2 GB – 2 TB of single-chassis RAM
  • 31. Soil metagenome assembly • Observation: 99% of microbes cannot easily be cultured in the lab. (“The great plate count anomaly”) • Many reasons why you can’t or don’t want to culture: • Syntrophic relationships • Niche-specificity or unknown physiology • Dormant microbes • Abundance within communities Single-cell sequencing & shotgun metagenomics are two common ways to investigate microbial communities.
  • 32. Investigating soil microbial ecology • What ecosystem level functions are present, and how do microbes do them? • How does agricultural soil differ from native soil? • How does soil respond to climate perturbation? • Questions that are not easy to answer without shotgun sequencing: • What kind of strain-level heterogeneity is present in the population? • What does the phage and viral population look like? • What species are where?
  • 34. A “Grand Challenge” dataset (DOE/JGI) 0 100 200 300 400 500 600 Iowa, Continuous corn Iowa, Native Prairie Kansas, Cultivated corn Kansas, Native Prairie Wisconsin, Continuous corn Wisconsin, Native Prairie Wisconsin, Restored Prairie Wisconsin, Switchgrass BasepairsofSequencing(Gbp) GAII HiSeq Rumen (Hess et. al, 2011), 268 Gbp MetaHIT (Qin et. al, 2011), 578 Gbp NCBI nr database, 37 Gbp Total: 1,846 Gbp soil metagenome Rumen K-mer Filtered, 111 Gbp
  • 35. Putting it in perspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp Assemblyresults for Iowacorn and prairie (2x~300Gbpsoilmetagenomes) Total Assembly Total Contigs (> 300 bp) % Reads Assembled Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Adina Howe
  • 36. Strain variation?Toptwoallelefrequencies Position within contig Of 5000 most abundant contigs, only 1 has a polymorphism rate > 5% Can measure by read mapping.
  • 37. Tentative observations from our soil samples: • We need 100x as much data… • Much of our sample may consist of phage. • Phylogeny varies more than functional predictions. • We see little to no strain variation within our samples • Not bulk soil -- • Very small, localized, and low coverage samples • We may be able to do selective really deep sequencing and then infer the rest from 16s. • Implications for soil aggregate assembly?
  • 38. I also work on… • Genome assembly & analysis • Transcriptome assembly and analysis • Interpretation of annoying large data sets
  • 40. Training opportunities • PLB/MMG 810 (Shiu; ??) • CSE 801/Intro BEACON course (Brown; FS ‘13) “Intro to Computational Science for Evolutionary Biologists” • CSE 801 bootcamp (late Sep) • Software Carpentry bootcamp(s) (late Sep) • Workshops in Applied Bioinformatics (Buell; ‘14?) • Next-Gen Sequence Analysis Workshop (Brown; summer ‘14) + a variety of genomics courses that I can’t keep track of! Becky Mansel will have these slides.
  • 41. Unsolicited advice Consider both faculty and non-faculty careers. • It’s a bad time to be looking for faculty positions, and it’s a bad time to be looking for funding; maybe this will improve in 10 years, maybe not. • A PhD qualifies you for many, many more things than we will (or can) tell you about! • Specific advice: • Network with industry folk; think beyond your advisor’s career. • Write a blog: ivory.idyll.org/blog/advice-to-scientists-on- blogging.html

Editor's Notes

  • #17: Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.
  • #37: Diginorm is a subsampling approach that may help assemble highly polymorphic sequences. Observed levels of variation are quite low relative to e.g. marine free spawning animals.