SlideShare a Scribd company logo
RIDING THE DATA 
TIDAL WAVE IN 
MICROBIOLOGY 
Adina Howe 
germslab.org (Genomics and Environmental Research in Microbial Systems) 
Argonne National Laboratory / Michigan State University 
Iowa State University, Ag & Biosystems Engr (January) 
Slides available at www.slideshare.com/adinachuanghowe 
Future of Big Data, Lincoln, NE 11/6/2014
Microbes are critical 
Climate Change 
USGCRP 2009 
Energy Supply 
www.alutiiq.com 
Human & Animal 
Health 
http://guardianlv.com/ 
Global Food 
Security 
An understanding 
of microbial ecology
Understanding community 
dynamics 
 Who is there? 
 What are they doing? 
 How are they doing it?
Understanding community 
dynamics 
 Who is there? 
 What are they doing? 
 How are they doing it? 
Kim Lewis, 2010
Gene / Genome Sequencing 
 Collect samples 
 Extract DNA 
 Sequence DNA 
 “Analyze” DNA to identify its content and origin 
Taxonomy 
(e.g., pathogenic E. Coli) 
Function 
(e.g., degrades cellulose)
Cost of Sequencing 
100,000,000 
1,000,000 
100,000 
10,000 
1,000 
100 
10 
1 
Stein, Genome Biology, 2010 
E. Coli genome 4,500,000 bp ($4.5M, 1992) 
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012 
Year 
0.1 
DNA Sequencing, Mbp per $ 
10,000,000
Rapidly decreasing costs with 
NGS Sequencing 
100,000,000 
10,000,000 
1,000,000 
100,000 
10,000 
1,000 
100 
10 
1 
Stein, Genome Biology, 2010 
Next Generation Sequencing 
4,500,000 bp (E. Coli, $200, presently) 
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012 
Year 
0.1 
DNA Sequencing, Mbp per $
Effects of low cost 
sequencing… 
First free-living bacterium sequenced 
for billions of dollars and years of 
analysis 
Personal genome can be 
mapped in a few days and 
hundreds to few thousand 
dollars
The experimental continuum 
Single Isolate 
Pure Culture 
Enrichment 
Mixed Cultures 
Natural systems
The era of big data in biology 
NGS (Shotgun) Sequencing 
(doubling time 5 months) 
100,000,000 
100,000,000 
10,000,000 
1,000,000 
1,000,000 
100,000 
100,000 
10,000 
10,000 
1,000 
1,000 
100 
100 
10 
10 
1 
1 
Stein, Genome Biology, 2010 
Computational Hardware 
(doubling time 14 months) 
Sanger Sequencing 
(doubling time 19 months) 
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012 
Year 
1,000,000 
100,000 
10,000 
1,000 
100 
10 
1 
0 
Disk Storage, Mb/$ 
0.1 
DNA Sequencing, Mbp per $ 
10,000,000 
0.1
Postdoc experience with data 
2003-2008 Cumulative sequencing in PhD = 2000 bp 
2008-2009 Postdoc Year 1 = 50 Gbp 
2009-2010 Postdoc Year 2 = 450 Gbp 
2014 = 50 Tbp 
2015 = 500 Tbp budgeted
THE DIRT ON SOIL 
MAGNIFICENT BIODIVERSITY 
Biodiversity in the dark, Wall et al., Nature Geoscience, 2010 Jeremy Burgress
THE DIRT ON SOIL 
SPATIAL HETEROGENEITY 
http://www.fao.org/ www.cnr.uidaho.edu
THE DIRT ON SOIL 
DYNAMIC
THE DIRT ON SOIL 
INTERACTIONS: BIOTIC, ABIOTIC, ABOVE, BELOW, SCALES 
Philippot, 2013, Nature Reviews Microbiology
I. Technical side of microbial big data in 
biology 
II. Future of big data in soil microbial 
communities 
III. Bottlenecks for microbiologists
Tackling Soil Biodiversity 
Source: Chuck Haney 
C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU) 
Janet Jansson, Susannah Tringe (JGI)
Lesson #1: Accessing information in 
data 
http://siliconangle.com/files/2010/09/image_thumb69.png
de novo assembly 
Raw sequencing data (“reads”) Computational algorithms Informative genes / genomes 
Compresses dataset size significantly 
Improved data quality (longer sequences, gene order) 
Reference not necessary (novelty)
Metagenome assembly…a scaling 
problem.
Shotgun sequencing and de novo 
assembly 
It was the Gest of times, it was the wor 
, it was the worst of timZs, it was the 
isdom, it was the age of foolisXness 
, it was the worVt of times, it was the 
mes, it was Ahe age of wisdom, it was th 
It was the best of times, it Gas the wor 
mes, it was the age of witdom, it was th 
isdom, it was tIe age of foolishness 
It was the best of times, it was the worst of times, it was the 
age of wisdom, it was the age of foolishness
Practical Challenges – Intensive 
computing 
Months of 
“computer 
crunching” on a 
super computer 
Howe et al, 2014, PNAS
Practical Challenges – Intensive 
computing 
Months of 
“computer 
crunching” on a 
super computer 
Howe et al, 2014, PNAS 
Assembly of 300 Gbp (70,000 
genomes worth) can be done with 
any assembly program in less 
than 14 GB RAM and less than 
24 hours. 
50 Gbp = 10,000 genomes
Natural community characteristics 
 Diverse 
 Many organisms 
(genomes)
Natural community characteristics 
 Diverse 
 Many organisms 
(genomes) 
 Variable abundance 
 Most abundant organisms, sampled 
more often 
 Assembly requires a minimum amount 
of sampling 
 More sequencing, more errors 
Sample 1x
Natural community characteristics 
 Diverse 
 Many organisms 
(genomes) 
 Variable abundance 
 Most abundant organisms, sampled 
more often 
 Assembly requires a minimum amount 
of sampling 
 More sequencing, more errors 
Sample 1x Sample 10x
Natural community characteristics 
 Diverse 
 Many organisms 
(genomes) 
 Variable abundance 
 Most abundant organisms, sampled 
more often 
 Assembly requires a minimum amount 
of sampling 
 More sequencing, more errors 
Overkill 
Sample 1x Sample 10x
Digital normalization 
Brown et al., 2012, arXiv 
Howe et al., 2014, PNAS 
Zhang et al., 2014, PLOS One
Digital normalization 
Brown et al., 2012, arXiv 
Howe et al., 2014, PNAS 
Zhang et al., 2014, PLOS One
Digital normalization 
Brown et al., 2012, arXiv 
Howe et al., 2014, PNAS 
Zhang et al., 2014, PLOS One
Digital normalization 
Brown et al., 2012, arXiv 
Howe et al., 2014, PNAS 
Zhang et al., 2014, PLOS One
Digital normalization 
Brown et al., 2012, arXiv 
Howe et al., 2014, PNAS 
Zhang et al., 2014, PLOS One
Digital normalization 
Brown et al., 2012, arXiv 
Howe et al., 2014, PNAS 
Zhang et al., 2014, PLOS One 
 Scales datasets for assembly up to 95% - same assembly 
outputs. 
 Genomes, mRNA-seq, metagenomes (soils, gut, water)
Tackling Soil Biodiversity 
Source: Chuck Haney 
C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU) 
Janet Jansson, Susannah Tringe (JGI)
The reality?
More like… 
Howe et. al, 2014, PNAS Source: Chuck Haney
What we learned from deeply sequencing 
soil 
 Grand Challenge effort – 
10% of soil biodiversity 
sampled 
 Incredible soil biodiversity 
(estimate required 10 
Tbp/sample) 
 “To boldly go where no man 
has gone before”: >60% 
Unknown 
400 
300 
200 
100 
0 
amino acid metabolism 
carbohydrate metabolism 
membrane transport 
signal transduction 
translation 
folding, sorting and degradation 
metabolism of cofactors and vitamins 
energy metabolism 
transport and catabolism 
lipid metabolism 
transcription 
cell growth and death 
replication and repair 
xenobiotics biodegradation and metabolism 
nucleotide metabolism 
glycan biosynthesis and metabolism 
metabolism of terpenoids and polyketides 
cell motility 
Total Count 
KO 
corn and prairie 
corn only 
prairie only 
Howe et al, 2014, PNAS 
Managed agriculture soils exhibit less 
diversity, likely from its history of 
cultivation.
If soil is so diverse, what are the most 
consistent signals we can see at the plot 
levels?
Is there an identifiable “soil 
functional core” (carbon cycling 
focus)? 
Ames, Iowa, COBS Field Site, Fertilized Prairie Whole Soil Samples 
4 deeply sampled whole soil metagenomes (16S rRNA 5000 reads, Shotgun 20-50 million Kirsten Hofmockel, Iowa State University
How much genetic sequence do 
you think is shared in 4 replicates? 
 More than 1%? 10%? 50%? 
 What kind of genes do you expect in this core? 
 Minimal critical genes will be abundant & diverse
How much genetic sequence do 
you think is shared in 4 replicates? 
 More than 1%? 10%? 50%? 
 What kind of genes do you expect in this core? 
 Minimal critical genes are varying abundances & 
diverse
Core genes: soil-specific
My vision (and future research) 
Microbial markers for ecosystem services 
Nutrient cycling 
Pathogens 
Antibiotic resistance 
Biodiversity
Capabilities exist already…
Is more data better? 
Bottlenecks for the emerging 
microbiologists
Technical obstacles in the big data 
deluge 
 Access to the data 
 Access to the resources 
Democratization of both data and resource access 
“80% of awards and 50% of $$ are for grants < 
$350,000” (Ian Foster) 
 Data volume and velocity 
 Previous efforts are difficult to integrate 
 Innovation is necessary
Data intensive microbiology 
Software Developers 
Computer Scientists 
Clinicians 
PIs 
Data generators 
Microbiologists 
Data Analyzers 
Statisticians 
Bioinformaticians 
http://ivory.idyll.org/blog/2014-the-emerging-field-of-data-intensive-biology.html
Big data nebraska
Big data nebraska
Social obstacles – the main 
challenge 
Shift of costs do not mean shift of 
expectations 
http://www.deluxebattery.com/25-hilarious-expectation-vs-reality-photos/ 
Dear PI, 
It will take longer than 
the time it took you to do 
your experiment to 
analyze the data. Please 
do not write me for 
results within 24 hours of 
your sequences 
becoming available. 
- Adina
Culture of sharing 
Metagenomic Datasets 
http://www.heathershumaker.com/
Training / Incentives 
Emails between collaborators don’t contain as 
much “science” as I’d like:
All analysis: accessible, 
reproducible, and automated
All analysis: accessible, 
reproducible, and automated 
To reproduce analysis in a publication, 
1. Rent Amazon EC2 computer 
2. Clone github repository containing data and scripts 
3. Open IPython notebook and execute 
To run same analysis on different dataset, 
1. Replace data files with your own data, execute notebook. 
2. Tweak scripts as needed.
The journey in summary
RIDING THE BIG DATA 
TIDAL WAVE OF MODERN 
MICROBIOLOGY 
Adina Howe 
Argonne National Laboratory / Michigan State University 
Iowa State University, Ag & Biosystems Engr (January) 
“ 
”
RIDING THE BIG DATA 
TIDAL WAVE OF MODERN 
MICROBIOLOGY 
Adina Howe 
Argonne National Laboratory / Michigan State University 
Iowa State University, Ag & Biosystems Engr (January)
Acknowledgements 
 C. Titus Brown (MSU) 
 James Tiedje (MSU) 
 Daina Ringus (UC) 
 Folker Meyer (ANL) 
 Eugene Chang (UC) 
 NSF Biology Postdoc Fellowship 
 DOE Great Lakes Bioenergy Research Center

More Related Content

PPT
Metagenomics sequencing
PPTX
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
PDF
EU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for Harmonization
PPTX
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
PPTX
2015 beacon-metagenome-tutorial
PDF
Building bioinformatics resources for the global community
PPTX
[2013.10.29] albertsen genomics metagenomics
PDF
Errors and Limitaions of Next Generation Sequencing
Metagenomics sequencing
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
EU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for Harmonization
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
2015 beacon-metagenome-tutorial
Building bioinformatics resources for the global community
[2013.10.29] albertsen genomics metagenomics
Errors and Limitaions of Next Generation Sequencing

What's hot (20)

PPTX
'Novel technologies to study the resistome'
PDF
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
PPT
CAMERA Presentation at KNAW ICoMM Colloquium May 2008
PDF
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
PPT
Advancing the Metagenomics Revolution
PPTX
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
PPTX
Bioinformatics as a tool for understanding carcinogenesis
PDF
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
PPT
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
PDF
Bayesian Taxonomic Assignment for the Next-Generation Metagenomics
PPT
Microbial Metagenomics Drives a New Cyberinfrastructure
PPT
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
PPT
Case studies of HTS / NGS applications
PPT
Phylogenomic methods for comparative evolutionary biology - University Colleg...
PDF
NGS Applications II (UEB-UAT Bioinformatics Course - Session 2.1.3 - VHIR, Ba...
PPTX
The Chills and Thrills of Whole Genome Sequencing
PPTX
Viral Metagenomics (CABBIO 20150629 Buenos Aires)
PPTX
Analysis of binning tool in metagenomics
PDF
Evaluation of Pool-Seq as a cost-effective alternative to GWAS
PDF
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
'Novel technologies to study the resistome'
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
CAMERA Presentation at KNAW ICoMM Colloquium May 2008
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
Advancing the Metagenomics Revolution
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
Bioinformatics as a tool for understanding carcinogenesis
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
Bayesian Taxonomic Assignment for the Next-Generation Metagenomics
Microbial Metagenomics Drives a New Cyberinfrastructure
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Case studies of HTS / NGS applications
Phylogenomic methods for comparative evolutionary biology - University Colleg...
NGS Applications II (UEB-UAT Bioinformatics Course - Session 2.1.3 - VHIR, Ba...
The Chills and Thrills of Whole Genome Sequencing
Viral Metagenomics (CABBIO 20150629 Buenos Aires)
Analysis of binning tool in metagenomics
Evaluation of Pool-Seq as a cost-effective alternative to GWAS
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
Ad

Viewers also liked (19)

PDF
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
PDF
Microbiome Isolation and DNA Enrichment Protocol: Pathogen Detection Webinar ...
PPT
High-Throughput Sequencing
PDF
Proposal for 2016 survey of WGS capacity in EU/EEA Member States
PDF
PDF
Aug2015 deanna church analytical validation
PPTX
The Global Micorbial Identifier (GMI) initiative - and its working groups
PPTX
Making Use of NGS Data: From Reads to Trees and Annotations
PPTX
Whole genome microbiology for Salmonella public health microbiology
PPTX
Genome Wide Methodologies and Future Perspectives
PDF
Whole Genome Sequencing (WGS): How significant is it for food safety?
 
PPTX
Toolbox for bacterial population analysis using NGS
PDF
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
PDF
Innovative NGS Library Construction Technology
PDF
DNA Sequencing from Single Cell
PPTX
Top 30 Resources For Instructional Designers
PPT
Aug2013 Heidi Rehm integrating large scale sequencing into clinical practice
PPSX
Work force energy ppt final wiki
PDF
Core java complete notes - Contact at +91-814-614-5674
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
Microbiome Isolation and DNA Enrichment Protocol: Pathogen Detection Webinar ...
High-Throughput Sequencing
Proposal for 2016 survey of WGS capacity in EU/EEA Member States
Aug2015 deanna church analytical validation
The Global Micorbial Identifier (GMI) initiative - and its working groups
Making Use of NGS Data: From Reads to Trees and Annotations
Whole genome microbiology for Salmonella public health microbiology
Genome Wide Methodologies and Future Perspectives
Whole Genome Sequencing (WGS): How significant is it for food safety?
 
Toolbox for bacterial population analysis using NGS
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
Innovative NGS Library Construction Technology
DNA Sequencing from Single Cell
Top 30 Resources For Instructional Designers
Aug2013 Heidi Rehm integrating large scale sequencing into clinical practice
Work force energy ppt final wiki
Core java complete notes - Contact at +91-814-614-5674
Ad

Similar to Big data nebraska (20)

PPTX
Big data nebraska
PPTX
Big Data Field Museum
PPTX
Sweden_eemis_big_data
PPTX
2015 mcgill-talk
PPTX
Job Talk Iowa State University Ag Bio Engineering
PPTX
2014 nyu-bio-talk
PPTX
2014 sage-talk
PDF
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
PPTX
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
PDF
metagenomicsanditsapplications-161222180924.pdf
PPTX
Metagenomics and it’s applications
PPTX
2013 bms-retreat-talk
PPTX
2014 marine-microbes-grc
PPTX
2013 stamps-intro-assembly
PPTX
2013 stamps-intro-assembly
PPTX
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
PPTX
Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...
PPTX
Unculturable micro organisms.pptx microbiology
PPT
Sequencing Genomics: The New Big Data Driver
PPTX
2012 erin-crc-nih-seattle
Big data nebraska
Big Data Field Museum
Sweden_eemis_big_data
2015 mcgill-talk
Job Talk Iowa State University Ag Bio Engineering
2014 nyu-bio-talk
2014 sage-talk
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
metagenomicsanditsapplications-161222180924.pdf
Metagenomics and it’s applications
2013 bms-retreat-talk
2014 marine-microbes-grc
2013 stamps-intro-assembly
2013 stamps-intro-assembly
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...
Unculturable micro organisms.pptx microbiology
Sequencing Genomics: The New Big Data Driver
2012 erin-crc-nih-seattle

More from Adina Chuang Howe (7)

PDF
Merrill Retreat 2018 - Nebraska City, Nebraska
PDF
2015 Soil Science of America Meeting
PDF
ISU ENVSCI690 Graduate Seminar Slides
PPTX
Adina's Faculty Introduction - ISU ABE
PPTX
Metagenomic data analysis discussion NEON Workshop
PPT
ASM 2013 Metagenomic Assembly Workshop Slides
PPTX
EPA 2013 Air Sensors Meeting Big Data Talk
Merrill Retreat 2018 - Nebraska City, Nebraska
2015 Soil Science of America Meeting
ISU ENVSCI690 Graduate Seminar Slides
Adina's Faculty Introduction - ISU ABE
Metagenomic data analysis discussion NEON Workshop
ASM 2013 Metagenomic Assembly Workshop Slides
EPA 2013 Air Sensors Meeting Big Data Talk

Big data nebraska

  • 1. RIDING THE DATA TIDAL WAVE IN MICROBIOLOGY Adina Howe germslab.org (Genomics and Environmental Research in Microbial Systems) Argonne National Laboratory / Michigan State University Iowa State University, Ag & Biosystems Engr (January) Slides available at www.slideshare.com/adinachuanghowe Future of Big Data, Lincoln, NE 11/6/2014
  • 2. Microbes are critical Climate Change USGCRP 2009 Energy Supply www.alutiiq.com Human & Animal Health http://guardianlv.com/ Global Food Security An understanding of microbial ecology
  • 3. Understanding community dynamics  Who is there?  What are they doing?  How are they doing it?
  • 4. Understanding community dynamics  Who is there?  What are they doing?  How are they doing it? Kim Lewis, 2010
  • 5. Gene / Genome Sequencing  Collect samples  Extract DNA  Sequence DNA  “Analyze” DNA to identify its content and origin Taxonomy (e.g., pathogenic E. Coli) Function (e.g., degrades cellulose)
  • 6. Cost of Sequencing 100,000,000 1,000,000 100,000 10,000 1,000 100 10 1 Stein, Genome Biology, 2010 E. Coli genome 4,500,000 bp ($4.5M, 1992) 1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012 Year 0.1 DNA Sequencing, Mbp per $ 10,000,000
  • 7. Rapidly decreasing costs with NGS Sequencing 100,000,000 10,000,000 1,000,000 100,000 10,000 1,000 100 10 1 Stein, Genome Biology, 2010 Next Generation Sequencing 4,500,000 bp (E. Coli, $200, presently) 1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012 Year 0.1 DNA Sequencing, Mbp per $
  • 8. Effects of low cost sequencing… First free-living bacterium sequenced for billions of dollars and years of analysis Personal genome can be mapped in a few days and hundreds to few thousand dollars
  • 9. The experimental continuum Single Isolate Pure Culture Enrichment Mixed Cultures Natural systems
  • 10. The era of big data in biology NGS (Shotgun) Sequencing (doubling time 5 months) 100,000,000 100,000,000 10,000,000 1,000,000 1,000,000 100,000 100,000 10,000 10,000 1,000 1,000 100 100 10 10 1 1 Stein, Genome Biology, 2010 Computational Hardware (doubling time 14 months) Sanger Sequencing (doubling time 19 months) 1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012 Year 1,000,000 100,000 10,000 1,000 100 10 1 0 Disk Storage, Mb/$ 0.1 DNA Sequencing, Mbp per $ 10,000,000 0.1
  • 11. Postdoc experience with data 2003-2008 Cumulative sequencing in PhD = 2000 bp 2008-2009 Postdoc Year 1 = 50 Gbp 2009-2010 Postdoc Year 2 = 450 Gbp 2014 = 50 Tbp 2015 = 500 Tbp budgeted
  • 12. THE DIRT ON SOIL MAGNIFICENT BIODIVERSITY Biodiversity in the dark, Wall et al., Nature Geoscience, 2010 Jeremy Burgress
  • 13. THE DIRT ON SOIL SPATIAL HETEROGENEITY http://www.fao.org/ www.cnr.uidaho.edu
  • 14. THE DIRT ON SOIL DYNAMIC
  • 15. THE DIRT ON SOIL INTERACTIONS: BIOTIC, ABIOTIC, ABOVE, BELOW, SCALES Philippot, 2013, Nature Reviews Microbiology
  • 16. I. Technical side of microbial big data in biology II. Future of big data in soil microbial communities III. Bottlenecks for microbiologists
  • 17. Tackling Soil Biodiversity Source: Chuck Haney C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU) Janet Jansson, Susannah Tringe (JGI)
  • 18. Lesson #1: Accessing information in data http://siliconangle.com/files/2010/09/image_thumb69.png
  • 19. de novo assembly Raw sequencing data (“reads”) Computational algorithms Informative genes / genomes Compresses dataset size significantly Improved data quality (longer sequences, gene order) Reference not necessary (novelty)
  • 21. Shotgun sequencing and de novo assembly It was the Gest of times, it was the wor , it was the worst of timZs, it was the isdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was th It was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishness It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
  • 22. Practical Challenges – Intensive computing Months of “computer crunching” on a super computer Howe et al, 2014, PNAS
  • 23. Practical Challenges – Intensive computing Months of “computer crunching” on a super computer Howe et al, 2014, PNAS Assembly of 300 Gbp (70,000 genomes worth) can be done with any assembly program in less than 14 GB RAM and less than 24 hours. 50 Gbp = 10,000 genomes
  • 24. Natural community characteristics  Diverse  Many organisms (genomes)
  • 25. Natural community characteristics  Diverse  Many organisms (genomes)  Variable abundance  Most abundant organisms, sampled more often  Assembly requires a minimum amount of sampling  More sequencing, more errors Sample 1x
  • 26. Natural community characteristics  Diverse  Many organisms (genomes)  Variable abundance  Most abundant organisms, sampled more often  Assembly requires a minimum amount of sampling  More sequencing, more errors Sample 1x Sample 10x
  • 27. Natural community characteristics  Diverse  Many organisms (genomes)  Variable abundance  Most abundant organisms, sampled more often  Assembly requires a minimum amount of sampling  More sequencing, more errors Overkill Sample 1x Sample 10x
  • 28. Digital normalization Brown et al., 2012, arXiv Howe et al., 2014, PNAS Zhang et al., 2014, PLOS One
  • 29. Digital normalization Brown et al., 2012, arXiv Howe et al., 2014, PNAS Zhang et al., 2014, PLOS One
  • 30. Digital normalization Brown et al., 2012, arXiv Howe et al., 2014, PNAS Zhang et al., 2014, PLOS One
  • 31. Digital normalization Brown et al., 2012, arXiv Howe et al., 2014, PNAS Zhang et al., 2014, PLOS One
  • 32. Digital normalization Brown et al., 2012, arXiv Howe et al., 2014, PNAS Zhang et al., 2014, PLOS One
  • 33. Digital normalization Brown et al., 2012, arXiv Howe et al., 2014, PNAS Zhang et al., 2014, PLOS One  Scales datasets for assembly up to 95% - same assembly outputs.  Genomes, mRNA-seq, metagenomes (soils, gut, water)
  • 34. Tackling Soil Biodiversity Source: Chuck Haney C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU) Janet Jansson, Susannah Tringe (JGI)
  • 36. More like… Howe et. al, 2014, PNAS Source: Chuck Haney
  • 37. What we learned from deeply sequencing soil  Grand Challenge effort – 10% of soil biodiversity sampled  Incredible soil biodiversity (estimate required 10 Tbp/sample)  “To boldly go where no man has gone before”: >60% Unknown 400 300 200 100 0 amino acid metabolism carbohydrate metabolism membrane transport signal transduction translation folding, sorting and degradation metabolism of cofactors and vitamins energy metabolism transport and catabolism lipid metabolism transcription cell growth and death replication and repair xenobiotics biodegradation and metabolism nucleotide metabolism glycan biosynthesis and metabolism metabolism of terpenoids and polyketides cell motility Total Count KO corn and prairie corn only prairie only Howe et al, 2014, PNAS Managed agriculture soils exhibit less diversity, likely from its history of cultivation.
  • 38. If soil is so diverse, what are the most consistent signals we can see at the plot levels?
  • 39. Is there an identifiable “soil functional core” (carbon cycling focus)? Ames, Iowa, COBS Field Site, Fertilized Prairie Whole Soil Samples 4 deeply sampled whole soil metagenomes (16S rRNA 5000 reads, Shotgun 20-50 million Kirsten Hofmockel, Iowa State University
  • 40. How much genetic sequence do you think is shared in 4 replicates?  More than 1%? 10%? 50%?  What kind of genes do you expect in this core?  Minimal critical genes will be abundant & diverse
  • 41. How much genetic sequence do you think is shared in 4 replicates?  More than 1%? 10%? 50%?  What kind of genes do you expect in this core?  Minimal critical genes are varying abundances & diverse
  • 43. My vision (and future research) Microbial markers for ecosystem services Nutrient cycling Pathogens Antibiotic resistance Biodiversity
  • 45. Is more data better? Bottlenecks for the emerging microbiologists
  • 46. Technical obstacles in the big data deluge  Access to the data  Access to the resources Democratization of both data and resource access “80% of awards and 50% of $$ are for grants < $350,000” (Ian Foster)  Data volume and velocity  Previous efforts are difficult to integrate  Innovation is necessary
  • 47. Data intensive microbiology Software Developers Computer Scientists Clinicians PIs Data generators Microbiologists Data Analyzers Statisticians Bioinformaticians http://ivory.idyll.org/blog/2014-the-emerging-field-of-data-intensive-biology.html
  • 50. Social obstacles – the main challenge Shift of costs do not mean shift of expectations http://www.deluxebattery.com/25-hilarious-expectation-vs-reality-photos/ Dear PI, It will take longer than the time it took you to do your experiment to analyze the data. Please do not write me for results within 24 hours of your sequences becoming available. - Adina
  • 51. Culture of sharing Metagenomic Datasets http://www.heathershumaker.com/
  • 52. Training / Incentives Emails between collaborators don’t contain as much “science” as I’d like:
  • 53. All analysis: accessible, reproducible, and automated
  • 54. All analysis: accessible, reproducible, and automated To reproduce analysis in a publication, 1. Rent Amazon EC2 computer 2. Clone github repository containing data and scripts 3. Open IPython notebook and execute To run same analysis on different dataset, 1. Replace data files with your own data, execute notebook. 2. Tweak scripts as needed.
  • 55. The journey in summary
  • 56. RIDING THE BIG DATA TIDAL WAVE OF MODERN MICROBIOLOGY Adina Howe Argonne National Laboratory / Michigan State University Iowa State University, Ag & Biosystems Engr (January) “ ”
  • 57. RIDING THE BIG DATA TIDAL WAVE OF MODERN MICROBIOLOGY Adina Howe Argonne National Laboratory / Michigan State University Iowa State University, Ag & Biosystems Engr (January)
  • 58. Acknowledgements  C. Titus Brown (MSU)  James Tiedje (MSU)  Daina Ringus (UC)  Folker Meyer (ANL)  Eugene Chang (UC)  NSF Biology Postdoc Fellowship  DOE Great Lakes Bioenergy Research Center

Editor's Notes

  • #2: Thank organizers Big data new, journey with big data 3 facts, 1” = 1000 yr, Major nutrient for crop growth, 50 yr recovery, Phosphorus, sustainable ag can mitiaget 0.3 – 1 ton of C/ha per year, 10% of all car carbon emissions
  • #3: Microbes are responsible for biogeochemical cyling of nutrients, proviidng benefits for plants for crop growth (especially in non ideal conditions), and protection against pathogens and disease. global population will require 70-100% increase in agricultural yields by 2050 Soil is black box Little luck in the past figuring out some very simple questions….
  • #4: The questions we have in understanding microbes have not changed much…
  • #5: Historically, we have been asking these questions in model organisms. The challenge of model organisms…comparing them to what we know is in the environment…
  • #6: First automated DNA sequencing machines late 80s, New ay of asking questions.
  • #7: Sequencing opened up the door to start building a catalog of some observed key microbial players. Driven by health and biotech interests. This is the same set of reference genes that are still in use today. Expensive.
  • #8: What changed the field was the invention of next generation technlogies, bascially allowing the throughput of these automated sequencers to be much higher and the cost of sequencing much cheaper. So cheap in fact that instead of sequencing only one bacteria you could start sequencing multiple, even bacteria from complex environments.
  • #9: Highlighted in recent news
  • #10: Opportunities and changes in the systems we study. So then the question is not only who is there and what they are doing? But what are they doing together and how?
  • #11: The growth – point out NGS imapct Accompanied by challenges of computation…even to store data on.
  • #12: Data during my career really reflects this groth. During postdoc, first year, 50 million reads to about 40x that within literally 9 months. data increased 25x million times…. Notice the gap from 2010 – 2014, figuring stuff out.
  • #13: Why do we need such vol. of data? Most of us now recognize that microbial communities generally exhibit a high level of diversity, much highter than previously assume by what was revealed by classical microscopy and basic culturing techniques. In soil, even in one gram of soil, there is estimated to be more microbial species than there are stars in the galaxy. We have far to go for any comprehensive characterization of any single soil community. A key question then Is why is soil diversity so high?
  • #14: One reason may be that the soil structure provides unique niche that provide a high diversity of food resources. Its varied structure provides stable, protective, and even ancient environments for microorganims.
  • #15: Soil investigations are further complicated by the primarily dormant state of the large majority of the soil microbial population. The turnover rate of soil microbes is predicted to be over 30 fold and even up to 300 fold slower than that of microbes in the oceans. And these microbes live in relatively unpredicatlbe patterns of pertubations – for example rainfall or leaf litter introduction. They also undergo defined temporal perturbations – diurnal energy input.
  • #16: This complexity in the soil has formed a dynamic microbial ecosystem which interacts with nutrients, plants, and the soil structure itself at multiple scales. I would argue that we as a field are still trying to find tractable methods of accessing these interactions and understanding the drivers of “healthy” or “productive” soils.
  • #17: Given soils complexity
  • #18: Soil biodiversity is amazing. Great Prairie – world’s most fertile. Important reference site for the biological baseis and ecosystems of soil microbial communities. It sequesters most carbon, produces large amount of biomass anually, key for biofuels and security. Know little about the who / what in these soils. Excitement about what we could clean now with the technologies.
  • #19: With growing volumes of data, the most obvious way to tractably access this data is to “smartly” reduce this data.
  • #20: One genomic way to reduce data is a process known as assembly. Assembly has been around since the sequencing of single organisms.
  • #21: Metagenomics…a problem of scale
  • #22: Assembly is the process of trying to come up with a consensus sequence based on finding overlaps in small fragments. In this example, we are coming up with a solution of one sentence using 8 fragments. In metagenomic assembly, you are trying to come up with hundreds to thousands to even millions of genomes using billions of fragments. And to do this, you have to compare each fragment to every other one in the dataset, making it very computationally intensive.
  • #23: Even the smallest dataset that I had at the beginning of my postdoc required several months on a supercomputer, something having over 100 GB of RAM. These were resources I simply didn’t have at this time. And for my larger datasets, there was simply nothing I could do with them, they would essentially crash any available assembly program that existed. So I had to come up with a way to deal with all of this data or essentially, there were a handful of Pis that had just invested tens of thousands of dollars in a project where we couldn’t tractably handle the datasets.
  • #24: I’m going to tell you now a little bit about how we were able to do this and there actually two different strategies we had to combine. Point out how many genomes this is. 70000
  • #25: First start thinking about what tare the natural chracteristics of environemntal communities. Diverse.: There are multiple genomes, and even potentially millions of species, in a sample.
  • #26: Variable abundance in nature, some are highly abundant some are not.
  • #27: This diversity and distribution of abundances means that we are unevenly sampling strains in the environment. If we want to sample the rarerest species….we need
  • #28: A strategy we came up with was can we come up with a way to come up with the minimal dataset that you need for assembly, discarding these reads from this overkill section?
  • #29: Lossy compression From a sequencing standpoint then, what we see is that for a given genome (represented here as a dotted line), we start sampling fragments from it.
  • #30: As we sample more, we will have some sequences which will have errors in it.
  • #31: And we’ll keep sequencing this genome, randomly sampling different parts of it. We’ll get to a point, where we’ll have enough sequences where we can make a good guess at what the original sequence may have looked like.
  • #32: For assembly, you need a minimum amount of information. So anything beyond this 6 is excessive or redundant information.
  • #33: So we can discard or set aside this read and not use it for our assembly. And that actually turns out to be a good thing because in discarding this information, we’re actually removing data with errors in it.
  • #34: minimal dataset needed for an assembly of the dataset here in pink and a redundant set of information which we have set aside. In setting aside these reads here in the red, discard errors Improve assembly
  • #35: Soil biodiversity is amazing. Great Prairie – world’s most fertile. Important reference site for the biological baseis and ecosystems of soil microbial communities. It sequesters most carbon, produces large amount of biomass anually, key for biofuels and security. Know little about the who / what in these soils. Excitement about what we could clean now with the technologies.
  • #37: More than half, 50-80% sequences unknown in soil, gut microbiome
  • #38: Overall, many funcitons are shared between corn and prairies soils. Interestingly, prairie soils have much many more unique functions (indicated here as blue bars) compared to unique functions in the corn (here green). This result may reflect the varying management history of these two soils. Unlike the prairie soils, which have never been tilled, the corn soils have been cultivated for more than 100 y and have had annual additions of animal manure that potentially could enrich specific metabolic pathways with decreased diversity.
  • #40: I took multiple samples from the same soil plot, and I wanted to know if there was an identifiable soil functional or even taxonomic core among all samples. In collaboration with a team at ISU, we extracted DNA from soils from a fertilzed prairie plot in Iowa and performed whole genome shotgun random sequencing on the extracted DNA. I then took this DNA, I processed with the data analysis tools I’d developed, and I think looked for sequences that were present in ALL four replicates and there at a minimal pretty conservative abundance. Further, I focused this analysis specifically on sequences which were similar to known carbon cycling genes.
  • #43: Explain why this is important. For biological markers in the soil, we have very site specific microbes that are providing services.
  • #44: The challenge of this is determining what these markers are…big data and sequencing Determining what these are across multiple ecosystems (variety of soils and even ater and air)…. a transition perhaps to breadth of sequencing in multiple sample rather than depth. combine it with sensors and environmental quality measurements. Identify site specific, function, specific, strain specific markers for ecosystem services.
  • #46: Finally, as the last part of my perspective on big data, I wanted to talk about the theme of this workshop? Is more data better? For me, the answer is always yes. I always want more data. This is largely attributable to the fact that I have a lot of experience working with this data and the resources to play with it. But that is not always the case. So what are the challenges of big data to the microbiologist or biologist?
  • #47: Syntactic incompatibility The first 90% of bioinformatics: your IDs are different from my IDs. Semantic incompatibility The second 90% of bioinformatics: what does “gene” mean in your database? Unstructured data
  • #48: More challenging is the emerging role a microbiologist now has to fill and the changing teams we are now involved in. I’m asked to play all these roles in various projects I’m involved in. And definitely, I’m asked to communicate to people in all these roles and they are asked to communicate with me. This communication can be challenging.
  • #49: For example, if you asked us all to describe a tire swing building project, you’d undoubtably get many varied descriptions
  • #51: Communication and social obstacles are the most difficult,
  • #52: The need to share and participate in interdiscipinary research come along with a culture of needing to demonstrate individual impact
  • #54: Total reproducibility of all figures – one button Change the dataset, redo entire analysis on your own data
  • #55: Total reproducibility of all figures – one button Change the dataset, redo entire analysis on your own data
  • #57: Journey begins with TG 2008
  • #58: 6 years later…