SlideShare a Scribd company logo
Trends in Genomics: An Engineer’s PerspectiveSaul A. Kravitz, PhDDecember 2009
Biggest Change:  Sequencing is free2000:   Factory, AB3700 @ Celera - 1k 500bp reads/day/sequener = 0.5Mbp/day- Human Genome = ~ 190 sequencer yr,  ~200M$20022002:   Factory, AB3730 @ JCVI - 10k 500bp reads/sequencer/day = 5Mbp/day- Human Genome = ~ 19 sequencer yr,    ~10M$20102010:  Benchtop, 454 GS Junior - 70M 500bp reads/day = 35Gbp/day - Human genome = ~ 1 sequencer day,   ~10k$2010:  Service, Complete Genomics- Human genome = ~ 1 day,                       ~1k$
New BottlenecksGenerating sequence data – freeData ManagementData QueryData AnalysisBreadth:   CommunitiesDepth:  Populations  (e.g., flu, human)Thinking is very pricy!
Same Thinking $, More DataProject Cost
The Crux of the ProblemGenomic data interpreted in contextHow does my genome compare to all othersWhich other proteins are similar to mineSize of context is growing exponentiallyGrowth is faster than Moore’s lawHard to fight an exponentialBLASTP against NCBI NRAll against all BLASTP of microbial proteins
Bioinformatics Isn’t High Energy PhysicsData inputs are changing rapidlyCE Chromatograms, 454 Flowgrams, Color SpaceError models and read lengths are changing rapidlyTools evolving rapidlyDifficult to track many academic toolsHigh quality commercial platforms emergeEven when “cooks” use shared “ingredients” “recipes” vary widelyFaith based scienceMy dataset alone has limited valueComputations are (relatively) IO Intensive
Some Solutions and DirectionsRepeated process must be automatedEven if labor is free, deviations from SOP costlyCommercial ToolsMarket has expanded, quality improvedTools for exploring Human VariationThe HuRef BrowserMetagenomics Tools and ChallengesGlobal Ocean Sampling ExpeditionVisualization toolsMetagenomic AnnotationGenome Standards Consortium and M5Clouds and GridsScaaS:  Science as a Service
Personal Genomics:   The future is now  (ca 2008)
HuRef Browser:  Accelerate thinkingCompare 2 published genomesCraig Venter’s Diploid GenomeComposite NCBI-36Are differences real?   Noisy data?Assembly errors?Analysis errors?Methods development requires curation by biologistsAs genomes accumulate, more acute challenge
HuRef Browser: http://huref.jcvi.org
Zinc Finger ProteinChr19:57564487-57581356TranscriptGeneHaplotype BlocksVariationsNCBI-36Assembly-Assembly MappingHuRefAssembly Structure
Protein Truncated by 476 bp InsertionHeterozygous SNPHomozygous SNPInsertion
Assembly StructureInsertion
Genomics vs MetagenomicsGenomics – ‘Old School’Study of a single organism's genome Genome sequence determined using shotgun sequencing and assembly>1300 microbes sequenced, first in 1995 (at TIGR)DNA usually obtained from pure cultures (<1%) or amplication of DNA from single cells Metagenomics  Use genomics tricks on communities – no culturingEnvironmental shotgun sequencing of DNA or RNAMetadata provides context
Metagenomic QuestionsWithin an environmentWhat biological functions are present (absent)?What organisms are present (absent)?Compare data from (dis)similar environmentsWhat are the fundamental rules of microbial ecology Adapting to environmental conditions?How do communities respond to stimuli?How does community structure change?Search for novel proteins and protein familiesAnd diversity within known families
Global Ocean Sampling Expedition
Global Ocean Sampling Expedition 178 Total Sampling Locations
Pilot:	      2.0M reads		        4/04
Phase 1:         7.7M reads, >6M proteins    3/07
Phase 2-IO:    2.2M reads                           3/08
Phase 2:       ~30M  reads                           2010?
Diverse Environments
Open ocean, estuary, embayment, upwelling, fringing reef, atoll…4/043/073/08
GOS:  Sequence Diversity in the OceanRusch et al (PLoS Biology2007)Most sequence reads are uniqueVery limited assemblyMost sequences not taxonomically anchoredReference genomes a basis set?  Not really.Several hundred isolatesChallengesRelating shotgun data to reference genomesStructural and Functional Annotation
Browsing Large Data Collections: Fragment Recruitment ViewerMicrobial Communities vs Reference GenomesMillions of sequence reads vs Thousands of genomesDefinition:   A read is recruited to a sequence if:End-to-end blastN alignment existsRapid Hypothesis Generation and ExplorationHow do cultured and wildtype genomes differ?Insertions, deletion, translocationsCorrelation with environmental factors
Fragment Recruitment ViewerSequence SimilarityGenomic PositionDoug Rusch, JCVI
Doug Rusch  and Michael Press
Doug Rusch  and Michael Press
GOS Protein AnalysisYooseph et al (PLoS Biology 2007)Novel clustering processSequence similarity based
Predict putative proteins and group into related clusters
Include GOS and all known proteinsFindingsGOS proteins

More Related Content

PPTX
Quantified Self On Being A Personal Genomic Observatory
PPT
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
PPT
Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...
PPTX
Big data nebraska
PPTX
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
PPTX
Graph properties of biological networks
PPT
Real-time Phylogenomics: Joe Parker
PPT
CAMERA Presentation at KNAW ICoMM Colloquium May 2008
Quantified Self On Being A Personal Genomic Observatory
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...
Big data nebraska
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Graph properties of biological networks
Real-time Phylogenomics: Joe Parker
CAMERA Presentation at KNAW ICoMM Colloquium May 2008

What's hot (20)

PPT
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
PPTX
Using Supercomputers to Discover the 100 Trillion Bacteria Living Within Each...
PPT
Creating a Cyberinfrastructure for Advanced Marine Microbial Ecology Research...
PPTX
Big data nebraska
PPT
Quantifying the Time Progression of the Interaction of the Human Immune Syste...
PPT
Using Supercomputers and Supernetworks to Explore the Ocean of Life
PPTX
Sweden_eemis_big_data
PPT
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
PPTX
Analyzing the Human Gut Microbiome Dynamics in Health and Disease Using Super...
PPTX
ContentMining at Cambridge
PPT
Living in a Microbial World
PPTX
Big Data Field Museum
PPT
Microbial Metagenomics and Human Health
PPT
Microbial Metagenomics Drives a New Cyberinfrastructure
PPT
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
PDF
EU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for Harmonization
PPT
Advancing the Metagenomics Revolution
PPTX
Using Supercomputers and Gene Sequencers to Discover Your Inner Microbiome
PPTX
[2013.10.29] albertsen genomics metagenomics
PDF
Nanopore long-read metagenomics
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
Using Supercomputers to Discover the 100 Trillion Bacteria Living Within Each...
Creating a Cyberinfrastructure for Advanced Marine Microbial Ecology Research...
Big data nebraska
Quantifying the Time Progression of the Interaction of the Human Immune Syste...
Using Supercomputers and Supernetworks to Explore the Ocean of Life
Sweden_eemis_big_data
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Analyzing the Human Gut Microbiome Dynamics in Health and Disease Using Super...
ContentMining at Cambridge
Living in a Microbial World
Big Data Field Museum
Microbial Metagenomics and Human Health
Microbial Metagenomics Drives a New Cyberinfrastructure
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
EU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for Harmonization
Advancing the Metagenomics Revolution
Using Supercomputers and Gene Sequencers to Discover Your Inner Microbiome
[2013.10.29] albertsen genomics metagenomics
Nanopore long-read metagenomics
Ad

Similar to Trends In Genomics (20)

PPT
Bioinformatics A Biased Overview
PPTX
2015 mcgill-talk
PPTX
Job Talk Iowa State University Ag Bio Engineering
PPTX
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
PPTX
ISB nov 2014
PDF
Pathogen Genome Data
PDF
Talk by J. Eisen for NZ Computational Genomics meeting
PPTX
Bioinformatics and its applications-converted.pptx
PPT
Protease Phylogeny
DOCX
rheumatoid arthritis
PDF
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
PPTX
2014 marine-microbes-grc
PPTX
Structural Systems Pharmacology
PPT
Human Reference Genome Browser Presentation at BIO-ITWorld 2008
PPTX
2015 beacon-metagenome-tutorial
PPT
Bms 2010
PDF
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
PPT
Use of data
PPT
Genomics and Proteomics - Impact on Drug Discovery
PPTX
Data analysis & integration challenges in genomics
Bioinformatics A Biased Overview
2015 mcgill-talk
Job Talk Iowa State University Ag Bio Engineering
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
ISB nov 2014
Pathogen Genome Data
Talk by J. Eisen for NZ Computational Genomics meeting
Bioinformatics and its applications-converted.pptx
Protease Phylogeny
rheumatoid arthritis
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
2014 marine-microbes-grc
Structural Systems Pharmacology
Human Reference Genome Browser Presentation at BIO-ITWorld 2008
2015 beacon-metagenome-tutorial
Bms 2010
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Use of data
Genomics and Proteomics - Impact on Drug Discovery
Data analysis & integration challenges in genomics
Ad

Recently uploaded (20)

PDF
Roadmap Map-digital Banking feature MB,IB,AB
PDF
Types of control:Qualitative vs Quantitative
PDF
WRN_Investor_Presentation_August 2025.pdf
PDF
20250805_A. Stotz All Weather Strategy - Performance review July 2025.pdf
PPTX
5 Stages of group development guide.pptx
PDF
DOC-20250806-WA0002._20250806_112011_0000.pdf
PPTX
Probability Distribution, binomial distribution, poisson distribution
PDF
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
PDF
Reconciliation AND MEMORANDUM RECONCILATION
PDF
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
PDF
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
PDF
Chapter 5_Foreign Exchange Market in .pdf
PDF
How to Get Funding for Your Trucking Business
PDF
Ôn tập tiếng anh trong kinh doanh nâng cao
PPT
340036916-American-Literature-Literary-Period-Overview.ppt
PPTX
Business Ethics - An introduction and its overview.pptx
PDF
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
PPTX
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
PDF
Power and position in leadershipDOC-20250808-WA0011..pdf
PPTX
Amazon (Business Studies) management studies
Roadmap Map-digital Banking feature MB,IB,AB
Types of control:Qualitative vs Quantitative
WRN_Investor_Presentation_August 2025.pdf
20250805_A. Stotz All Weather Strategy - Performance review July 2025.pdf
5 Stages of group development guide.pptx
DOC-20250806-WA0002._20250806_112011_0000.pdf
Probability Distribution, binomial distribution, poisson distribution
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
Reconciliation AND MEMORANDUM RECONCILATION
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
Chapter 5_Foreign Exchange Market in .pdf
How to Get Funding for Your Trucking Business
Ôn tập tiếng anh trong kinh doanh nâng cao
340036916-American-Literature-Literary-Period-Overview.ppt
Business Ethics - An introduction and its overview.pptx
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
Power and position in leadershipDOC-20250808-WA0011..pdf
Amazon (Business Studies) management studies

Trends In Genomics

  • 1. Trends in Genomics: An Engineer’s PerspectiveSaul A. Kravitz, PhDDecember 2009
  • 2. Biggest Change: Sequencing is free2000: Factory, AB3700 @ Celera - 1k 500bp reads/day/sequener = 0.5Mbp/day- Human Genome = ~ 190 sequencer yr, ~200M$20022002: Factory, AB3730 @ JCVI - 10k 500bp reads/sequencer/day = 5Mbp/day- Human Genome = ~ 19 sequencer yr, ~10M$20102010: Benchtop, 454 GS Junior - 70M 500bp reads/day = 35Gbp/day - Human genome = ~ 1 sequencer day, ~10k$2010: Service, Complete Genomics- Human genome = ~ 1 day, ~1k$
  • 3. New BottlenecksGenerating sequence data – freeData ManagementData QueryData AnalysisBreadth: CommunitiesDepth: Populations (e.g., flu, human)Thinking is very pricy!
  • 4. Same Thinking $, More DataProject Cost
  • 5. The Crux of the ProblemGenomic data interpreted in contextHow does my genome compare to all othersWhich other proteins are similar to mineSize of context is growing exponentiallyGrowth is faster than Moore’s lawHard to fight an exponentialBLASTP against NCBI NRAll against all BLASTP of microbial proteins
  • 6. Bioinformatics Isn’t High Energy PhysicsData inputs are changing rapidlyCE Chromatograms, 454 Flowgrams, Color SpaceError models and read lengths are changing rapidlyTools evolving rapidlyDifficult to track many academic toolsHigh quality commercial platforms emergeEven when “cooks” use shared “ingredients” “recipes” vary widelyFaith based scienceMy dataset alone has limited valueComputations are (relatively) IO Intensive
  • 7. Some Solutions and DirectionsRepeated process must be automatedEven if labor is free, deviations from SOP costlyCommercial ToolsMarket has expanded, quality improvedTools for exploring Human VariationThe HuRef BrowserMetagenomics Tools and ChallengesGlobal Ocean Sampling ExpeditionVisualization toolsMetagenomic AnnotationGenome Standards Consortium and M5Clouds and GridsScaaS: Science as a Service
  • 8. Personal Genomics: The future is now (ca 2008)
  • 9. HuRef Browser: Accelerate thinkingCompare 2 published genomesCraig Venter’s Diploid GenomeComposite NCBI-36Are differences real? Noisy data?Assembly errors?Analysis errors?Methods development requires curation by biologistsAs genomes accumulate, more acute challenge
  • 11. Zinc Finger ProteinChr19:57564487-57581356TranscriptGeneHaplotype BlocksVariationsNCBI-36Assembly-Assembly MappingHuRefAssembly Structure
  • 12. Protein Truncated by 476 bp InsertionHeterozygous SNPHomozygous SNPInsertion
  • 14. Genomics vs MetagenomicsGenomics – ‘Old School’Study of a single organism's genome Genome sequence determined using shotgun sequencing and assembly>1300 microbes sequenced, first in 1995 (at TIGR)DNA usually obtained from pure cultures (<1%) or amplication of DNA from single cells Metagenomics Use genomics tricks on communities – no culturingEnvironmental shotgun sequencing of DNA or RNAMetadata provides context
  • 15. Metagenomic QuestionsWithin an environmentWhat biological functions are present (absent)?What organisms are present (absent)?Compare data from (dis)similar environmentsWhat are the fundamental rules of microbial ecology Adapting to environmental conditions?How do communities respond to stimuli?How does community structure change?Search for novel proteins and protein familiesAnd diversity within known families
  • 17. Global Ocean Sampling Expedition 178 Total Sampling Locations
  • 18. Pilot: 2.0M reads 4/04
  • 19. Phase 1: 7.7M reads, >6M proteins 3/07
  • 20. Phase 2-IO: 2.2M reads 3/08
  • 21. Phase 2: ~30M reads 2010?
  • 23. Open ocean, estuary, embayment, upwelling, fringing reef, atoll…4/043/073/08
  • 24. GOS: Sequence Diversity in the OceanRusch et al (PLoS Biology2007)Most sequence reads are uniqueVery limited assemblyMost sequences not taxonomically anchoredReference genomes a basis set? Not really.Several hundred isolatesChallengesRelating shotgun data to reference genomesStructural and Functional Annotation
  • 25. Browsing Large Data Collections: Fragment Recruitment ViewerMicrobial Communities vs Reference GenomesMillions of sequence reads vs Thousands of genomesDefinition: A read is recruited to a sequence if:End-to-end blastN alignment existsRapid Hypothesis Generation and ExplorationHow do cultured and wildtype genomes differ?Insertions, deletion, translocationsCorrelation with environmental factors
  • 26. Fragment Recruitment ViewerSequence SimilarityGenomic PositionDoug Rusch, JCVI
  • 27. Doug Rusch and Michael Press
  • 28. Doug Rusch and Michael Press
  • 29. GOS Protein AnalysisYooseph et al (PLoS Biology 2007)Novel clustering processSequence similarity based
  • 30. Predict putative proteins and group into related clusters
  • 31. Include GOS and all known proteinsFindingsGOS proteins
  • 32. cover ~all existing prokaryotic families
  • 33. expands diversity of known protein families
  • 34. ~10% of large clusters are novel
  • 35. Many are of viral origin
  • 36. No saturation in the rate of novel protein family discoveryAdded Protein Family DiversityYooseph et al (PLoS 2007)Rubisco homologsKnown eukaryotesKnown prokaryotesGOS prokaryotes New Groups
  • 37. Annotation ofEnvironmental Shotgun DataChallenges:Lack of contextProtein fragmentsGene FindingYooseph’s Protein Clusters + MetageneFunctional AssignmentVariation of JCVI prok annotation pipeline*Leverages protein cluster annotation -- soonResult:Quality Nearly Comparable to Prokaryotic Genomic Annotation
  • 38. Protein ClustersAdvantages and DisadvantagesWeaknessesHomology-basedStateful (also a strength)Less sensitive (for now)StrengthsExponential  Linear?Learns over timeEasy to maintain
  • 39. Increasing the pressureNextgen + MetagenomicsDeeper collectionsShort sequences  less informativeHow should we annotate?When in doubt, use BLAST against NRAA, and other large and fast-growing collectionsAnnotation needs growing dramatically24x7 quality softwareSpecial Hardware: FPGA? Grahics/CUDA? SIMD/SSE?New algorithms?Back to supercomputers?Sharing data and computesStandardization of data, metadata, and computesFolker Meyer, ANL
  • 40. Science as a Service (ScaaS)Standard tools as servicesService-Oriented ArchitectureSupported by HPC as necessaryGrid workflow for integrationMaintain tools & data in scalable compute environmentCelera Assembler in the clouds
  • 41. Vision for High Throughput ScienceToday:ScientistConstruction of the Ark. Nuremberg Chronicle (1493).
  • 42. Vision for High Throughput ScienceEngineersScientist+http://freepages.genealogy.rootsweb.ancestry.com/~thegrove/gec2a.htmlRodin’s Thinker
  • 43. CreditsJCVI Informatics TeamSupportDOEGordon and Betty Moore FoundationNIAID

Editor's Notes

  • #9: With the publication of the genomes of Craig Venter and Jim Watson, and with many additional human genomes being sequenced, the era of personal genomics is here.We are going to need really good tools to take advantage of this flood of data. My goal today is to share our experience building tools to understand the variation within a single individual’s genome, and try to extrapolate forward to what we will need to understand larger collections of genomes.
  • #11: * A chromosome or sequence id followed by a start position and region length e.g., "chr19:450000+100000" to display the region from 450000-550000 on chromosome 19. * A dbSNP id e.g., "rs2691286" * An Ensembl annotation identifer e.g., "ENSG00000104783" * A gene name, e.g. "KLKB1", optionally followed by the amount of flanking sequence to display e.g., "KLKB1^2000"
  • #12: Zinc Finger example whole transcript ENST00000334564
  • #13: INSERT IS 467 BP  TRUNCATES THE PROTEINVNTRPROB HETEROzygousPink = non-synYellow –synpnymous