SlideShare a Scribd company logo
Closing the Gap in Time: From Raw Data to Real ScienceScience as a Service (ScaaS)Justin H. JohnsonDirector of BioinformaticsEdgeBioGaithersburg, MD
Edge Bio & MeSequencing and Bioinformatics ShopI am aBioinformaticianIT ProfessionalScientistThe lines are blurred…
NGS Exponential Growth Nature Biotechnology Volume 26  Number10  October2008
Sequencing is Free?Machines and VendorsLab Staffing, Integrations and LIMSProject DesignData Management and QCBioinformatics and Data AnalysisData Computes and StorageData Sharing
Machines and VendorsGnuBio
Ultra High Throughput + Lower Cost = Broader Applications
Bioinformatics Tools	* BFAST - Blat-like Fast Accurate Search Tool. Written by Nils Homer, Stanley F. Nelson and Barry Merriman at UCLA.* Bowtie - Ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of 25 million reads per hour on a typical workstation with 2 gigabytes of memory. Uses a Burrows-Wheeler-Transformed (BWT) index. Link to discussion thread here. Written by Ben Langmead and Cole Trapnell. Linux, Windows, and Mac OS X.* BWA - Heng Lee's BWT Alignment program - a progression from Maq. BWA is a fast light-weighted tool that aligns short sequences to a sequence database, such as the human reference genome. By default, BWA finds an alignment within edit distance 2 to the query sequence. C++ source.* ELAND - Efficient Large-Scale Alignment of Nucleotide Databases. Whole genome alignments to a reference genome. Written by Illumina author Anthony J. Cox for the Solexa 1G machine.* Exonerate - Various forms of pairwise alignment (including Smith-Waterman-Gotoh) of DNA/protein against a reference. Authors are Guy St C Slater and Ewan Birney from EMBL. C for POSIX.* GenomeMapper - GenomeMapper is a short read mapping tool designed for accurate read alignments. It quickly aligns millions of reads either with ungapped or gapped alignments. A tool created by the 1001 Genomes project. Source for POSIX.* GMAP - GMAP (Genomic Mapping and Alignment Program) for mRNA and EST Sequences. Developed by Thomas Wu and Colin Watanabe at Genentec. C/Perl for Unix.* gnumap - The Genomic Next-generation Universal MAPper (gnumap) is a program designed to accurately map sequence data obtained from next-generation sequencing machines (specifically that of Solexa/Illumina) back to a genome of any size. It seeks to align reads from nonunique repeats using statistics. From authors at Brigham Young University. C source/Unix.* MAQ - Mapping and Assembly with Qualities (renamed from MAPASS2). Particularly designed for Illumina with preliminary functions to handle ABI SOLiD data. Written by Heng Li from the Sanger Centre. Features extensive supporting tools for DIP/SNP detection, etc. C++ source* MOSAIK - MOSAIK produces gapped alignments using the Smith-Waterman algorithm. Features a number of support tools. Support for Roche FLX, Illumina, SOLiD, and Helicos. Written by Michael Strömberg at Boston College. Win/Linux/MacOSX* MrFAST and MrsFAST - mrFAST & mrsFAST are designed to map short reads generated with the Illumina platform to reference genome assemblies; in a fast and memory-efficient manner. Robust to INDELs and MrsFAST has a bisulphite mode. Authors are from the University of Washington. C as source.* MUMmer - MUMmer is a modular system for the rapid whole genome alignment of finished or draft sequence. Released as a package providing an efficient suffix tree library, seed-and-extend alignment, SNP detection, repeat detection, and visualization tools. Version 3.0 was developed by Stefan Kurtz, Adam Phillippy, Arthur L Delcher, Michael Smoot, Martin Shumway, CorinaAntonescu and Steven L Salzberg - most of whom are at The Institute for Genomic Research in Maryland, USA. POSIX OS required.* Novocraft - Tools for reference alignment of paired-end and single-end Illumina reads. Uses a Needleman-Wunsch algorithm. Can support Bis-Seq. Commercial. Available free for evaluation, educational use and for use on open not-for-profit projects. Requires Linux or Mac OS X.* PASS - It supports Illumina, SOLiD and Roche-FLX data formats and allows the user to modulate very finely the sensitivity of the alignments. Spaced seed intial filter, then NW dynamic algorithm to a SW(like) local alignment. Authors are from CRIBI in Italy. Win/Linux.* RMAP - Assembles 20 - 64 bpIllumina reads to a FASTA reference genome. By Andrew D. Smith and ZhenyuXuan at CSHL. (published in BMC Bioinformatics). POSIX OS required.* SeqMap - Supports up to 5 or more bp mismatches/INDELs. Highly tunable. Written by Hui Jiang from the Wong lab at Stanford. Builds available for most OS's.* SHRiMP - Assembles to a reference sequence. Developed with Applied Biosystem'scolourspace genomic representation in mind. Authors are Michael Brudno and Stephen Rumble at the University of Toronto. POSIX.* Slider- An application for the Illumina Sequence Analyzer output that uses the probability files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences. Authors are from BCGSC. Paper is here.* SOAP - SOAP (Short Oligonucleotide Alignment Program). A program for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. The updated version uses a BWT. Can call SNPs and INDELs. Author is Ruiqiang Li at the Beijing Genomics Institute. C++, POSIX.* SSAHA - SSAHA (Sequence Search and Alignment by Hashing Algorithm) is a tool for rapidly finding near exact matches in DNA or protein databases using a hash table. Developed at the Sanger Centre by ZeminNing, Anthony Cox and James Mullikin. C++ for Linux/Alpha.* SOCS - Aligns SOLiD data. SOCS is built on an iterative variation of the Rabin-Karp string search algorithm, which uses hashing to reduce the set of possible matches, drastically increasing search speed. Authors are Ondov B, Varadarajan A, Passalacqua KD and Bergman NH.* SWIFT - The SWIFT suit is a software collection for fast index-based sequence comparison. It contains: SWIFT — fast local alignment search, guaranteeing to find epsilon-matches between two sequences. SWIFT BALSAM — a very fast program to find semiglobal non-gapped alignments based on k-mer seeds. Authors are Kim Rasmussen (SWIFT) and Wolfgang Gerlach (SWIFT BALSAM)* SXOligoSearch - SXOligoSearch is a commercial platform offered by the Malaysian based Synamatix. Will align Illumina reads against a range of Refseq RNA or NCBI genome builds for a number of organisms. Web Portal. OS independent.* Vmatch - A versatile software tool for efficiently solving large scale sequence matching tasks. Vmatch subsumes the software tool REPuter, but is much more general, with a very flexible user interface, and improved space and time requirements. Essentially a large string matching toolbox. POSIX.* Zoom - ZOOM (Zillions Of Oligos Mapped) is designed to map millions of short reads, emerged by next-generation sequencing technology, back to the reference genomes, and carry out post-analysis. ZOOM is developed to be highly accurate, flexible, and user-friendly with speed being a critical priority. Commercial. Supports Illumina and SOLiD data.Courtesy of SeqAnsers.com
Experimental Design ConsiderationsSequencing Platform in Use
Choice of Library Construction
Depth of coverage
Re$ources
Number of Replicates
Number of Samples and Control
Etc…The Sky Isn’t Falling
Its Building…11696384100,000350,000,000Tomorrow?
…and distributingDistributing the ProblemExchanging DataRefreshing Data
How do we avoid the Perfect Storm?http://www.flickr.com/photos/nationalmaritimemuseum/3115298137/
Life Vests?ScienceasaServiceSaasIaasPaasHaasDaas
Edge Bio
You can’t do this alone, neither can we.Elaine Mardis, Ph.D.Co-Director, Genome Sequencing CenterWashington University School of MedicineSam Levy, Ph.D.Director of Genomic SciencesProfessor of Translational Genomics & Human Genomic MedicineScripps Translational Science InstituteScripps HealthScripps Research InstituteMichael Zody, Ph.D.Chief TechnologistBroad Institute of MITKen Dewar, Ph.D.Assistant ProfessorMcGill University and Quebec GenomeSteven Salzberg, Ph.D.Director, Center for Bioinformatics and Computational BiologyUniversity of MarylandGabor Marth, Ph.D.Professor of BioinformaticsBoston CollegeElliott Margulies, Ph.D.InvestigatorGenome Informatics SectionNational Human Genome Research InstituteNational Institutes of HealthScientific Advisory Board
EdgeBio
Edge Bio
BioinformaticsCloud Computing (Iaas, PaaS)Amazon, Google, OthersNGS Software and AlgorithmsCommercial (CLC) and Open SourceFrameworks(cloud)Biolinux, Hadoop and ChefData Sharing and Standards (GSC/M5)
Cloud
CloudNot a talk on cloud computing…Either you already know what it is – OR – Other can do it more justice with more time.The 4 dollar genome.
An Academic ProblemHave 100K+ free computes avail (Teragrid)Can’t get to them easilyHave unlimited pool of computes (Amazon)Can’t figure out how to pay for themIf you have figured those out…Traditional IssuesPipes too smallSecurity
Edge Science Gateway (ESG)Easily leverage any source at the back endCloudTeraGridInternal HPC ClusterInternal cloud (Eucalyptus)Commercial partners provide thin clients layerCLC Bio
Edge Science Gateway (ESG)Plug and Play ToolsChef and (cloud)Bio-LinuxReal people, not services underneath it all.IFX and IT consulting
Edge Exome Analysis PipelineBase Call, Quality FilterMapGenomeExon junctionsVariant AnalysisSNPINDELVariant AnnotationRefseq (etc)dbSNP/100GenomesSIFT, PolyPhenOMIMRepeat DatabaseVisualizationFunctional and Structural Analysis
Sometimes money isn’t enoughTraditional Infrastructure or New Era CloudsStill costs, whether up front or on back endAlways want to minimize costFrameworks and Algorithms R&DNGS Software and Algorithmscloud(Bio-Linux)HadoopChef
NGS Software and AlgorithmsBest of Breed though pluggable architectureMake them faster…Map Reduce Blast & GPU Enable BlastSIMD Accelerated HMMR, ClustalW, SW (CLC Bio)…or run them less.Clustering algorithmsEfficient data refreshesSharing of results
FrameworksCloud(BioLinux)Easily deploy others software though virtualizationChef (an Intro)Open Source configuration management for your entire infrastructureMPI and HadoopSpeed up and I/O issue resolution with MPICloudBurst – mapReduce based NGS mapping
Edge BioServ Services  Project  goals and timelinesNumber of samplesNumber of reads/tags per sampleExperiment and Project DesignSample Preparation Library PreparationSample Capture  Library ConstructionSample QC and QuantificationAdaptor ligation  ProjectWorkflowAmplification & SequenceFragment amplificationNext Generation Sequencing run Align sequence to reference genomeor RNA databaseSecondary and Tertiary AnalysisData Analysis
Real World Examples1500+ Sample Epigenetic StudyChallengesTracking
QC
DeliverySolutionMix of Commercial and Open Tools
Geospiza
CLC Bio
Customs Algorithms
HPC and Storage
Onsite 200 TB NAS
S3 for Backup (Soon for Delivery)Real World ExamplesHundreds of exomes over the next 6 months.ChallengesPipeline for mapping and variant analysisSolutionMix of commercial, in house, and open source tools.

More Related Content

PPTX
Enabling Large Scale Sequencing Studies through Science as a Service
PPTX
Cloud bioinformatics 2
PPTX
Data analysis & integration challenges in genomics
PPTX
2013 talk at TGAC, November 4
PPTX
Bioinformatics
DOCX
Bioinformatics Final Report
PPTX
2014 anu-canberra-streaming
PPTX
Computational Resources In Infectious Disease
Enabling Large Scale Sequencing Studies through Science as a Service
Cloud bioinformatics 2
Data analysis & integration challenges in genomics
2013 talk at TGAC, November 4
Bioinformatics
Bioinformatics Final Report
2014 anu-canberra-streaming
Computational Resources In Infectious Disease

What's hot (20)

PPTX
Biological database
PPTX
Transcript detection in RNAseq
PDF
Finding Allelic Frequencies Using MapReduce/Hadoop
PPTX
Eccmid meet the expert 2015
PPTX
Bioinformatica 29-09-2011-t1-bioinformatics
PDF
UCSC MS bioinformatics report 2010
PPT
Integrating phylogenetic inference and metadata visualization for NGS data
PPTX
PPTX
GLBIO/CCBC Metagenomics Workshop
DOC
PDF
16S rRNA Analysis using Mothur Pipeline
PPTX
DNA Sequence Data in Big Data Perspective
PPTX
2014 bangkok-talk
PDF
Article
PDF
Initial steps towards a production platform for DNA sequence analysis on the ...
PDF
Apollo Introduction for the Chestnut Research Community
PPTX
2015 illinois-talk
PPTX
Variant (SNPs/Indels) calling in DNA sequences, Part 1
PDF
Introduction to 16S Microbiome Analysis
ODP
DNA analysis on your laptop: Spot the differences
Biological database
Transcript detection in RNAseq
Finding Allelic Frequencies Using MapReduce/Hadoop
Eccmid meet the expert 2015
Bioinformatica 29-09-2011-t1-bioinformatics
UCSC MS bioinformatics report 2010
Integrating phylogenetic inference and metadata visualization for NGS data
GLBIO/CCBC Metagenomics Workshop
16S rRNA Analysis using Mothur Pipeline
DNA Sequence Data in Big Data Perspective
2014 bangkok-talk
Article
Initial steps towards a production platform for DNA sequence analysis on the ...
Apollo Introduction for the Chestnut Research Community
2015 illinois-talk
Variant (SNPs/Indels) calling in DNA sequences, Part 1
Introduction to 16S Microbiome Analysis
DNA analysis on your laptop: Spot the differences
Ad

Viewers also liked (20)

PDF
Raw data in, Insights out - CKANcon 2015
KEY
Labscope intro
PDF
From Raw Data to Deployed Product. Fast & Agile with CRISP-DM
PPTX
Delta GMP Data Integrity Sept2016
PDF
Common ways to avoid frequent gmp errors
PPT
Quality risk management
PPSX
Gmp qa and doccumentation by kailash vilegave
PPTX
Out of specification shravan
PPT
Good Manufacturing Practices For Quality Control
PPTX
Good manufacturing practice (GMP)
PPT
Good Documentation Pactise dr. amsavel
PPT
GDP:an overview
PPT
GMP Training
PPT
Good Documentation Practice
PDF
RNA-seq: analysis of raw data and preprocessing - part 2
PDF
Importance of documentation for gmp compliance
PPT
Back To Basic Gmp
PPT
Quality assurance ppt
PPT
Slideshare Powerpoint presentation
Raw data in, Insights out - CKANcon 2015
Labscope intro
From Raw Data to Deployed Product. Fast & Agile with CRISP-DM
Delta GMP Data Integrity Sept2016
Common ways to avoid frequent gmp errors
Quality risk management
Gmp qa and doccumentation by kailash vilegave
Out of specification shravan
Good Manufacturing Practices For Quality Control
Good manufacturing practice (GMP)
Good Documentation Pactise dr. amsavel
GDP:an overview
GMP Training
Good Documentation Practice
RNA-seq: analysis of raw data and preprocessing - part 2
Importance of documentation for gmp compliance
Back To Basic Gmp
Quality assurance ppt
Slideshare Powerpoint presentation
Ad

Similar to Closing the Gap in Time: From Raw Data to Real Science (20)

PPTX
Imgc2011 bioinformatics tutorial
PPT
Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
PDF
PPTX
Informal presentation on bioinformatics
PPT
20100516 bioinformatics kapushesky_lecture08
PPTX
Bioinformatics and its applications-converted.pptx
PPT
Sequencealignmentinbioinformatics 100204112518-phpapp02
PPTX
2012 sept 18_thug_biotech
PPTX
Cool Informatics Tools and Services for Biomedical Research
PDF
20110524zurichngs 1st pub
PPTX
Bioinformatics_1_ChenS.pptx
PPTX
Bioinformaatics for M.Sc. Biotecchnology.pptx
PPTX
Bioinformatics
PDF
A Survey on Bioinformatics Tools
PDF
Ernesto Picardi – Bioinformatica e genomica comparata: nuove strategie sperim...
PPTX
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
PPTX
Bioinformatic tool for Annotation of gene
PPTX
Reproducibility - The myths and truths of pipeline bioinformatics
PPTX
Under the Hood of Alignment Algorithms for NGS Researchers
PPTX
Bioinformatic, and tools by kk sahu
Imgc2011 bioinformatics tutorial
Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
Informal presentation on bioinformatics
20100516 bioinformatics kapushesky_lecture08
Bioinformatics and its applications-converted.pptx
Sequencealignmentinbioinformatics 100204112518-phpapp02
2012 sept 18_thug_biotech
Cool Informatics Tools and Services for Biomedical Research
20110524zurichngs 1st pub
Bioinformatics_1_ChenS.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformatics
A Survey on Bioinformatics Tools
Ernesto Picardi – Bioinformatica e genomica comparata: nuove strategie sperim...
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
Bioinformatic tool for Annotation of gene
Reproducibility - The myths and truths of pipeline bioinformatics
Under the Hood of Alignment Algorithms for NGS Researchers
Bioinformatic, and tools by kk sahu

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPT
What is a Computer? Input Devices /output devices
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
August Patch Tuesday
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PPT
Module 1.ppt Iot fundamentals and Architecture
NewMind AI Weekly Chronicles – August ’25 Week III
NewMind AI Weekly Chronicles - August'25-Week II
A contest of sentiment analysis: k-nearest neighbor versus neural network
Programs and apps: productivity, graphics, security and other tools
1 - Historical Antecedents, Social Consideration.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
A comparative study of natural language inference in Swahili using monolingua...
Univ-Connecticut-ChatGPT-Presentaion.pdf
What is a Computer? Input Devices /output devices
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Hindi spoken digit analysis for native and non-native speakers
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
August Patch Tuesday
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
observCloud-Native Containerability and monitoring.pptx
Developing a website for English-speaking practice to English as a foreign la...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Module 1.ppt Iot fundamentals and Architecture

Closing the Gap in Time: From Raw Data to Real Science

  • 1. Closing the Gap in Time: From Raw Data to Real ScienceScience as a Service (ScaaS)Justin H. JohnsonDirector of BioinformaticsEdgeBioGaithersburg, MD
  • 2. Edge Bio & MeSequencing and Bioinformatics ShopI am aBioinformaticianIT ProfessionalScientistThe lines are blurred…
  • 3. NGS Exponential Growth Nature Biotechnology Volume 26 Number10 October2008
  • 4. Sequencing is Free?Machines and VendorsLab Staffing, Integrations and LIMSProject DesignData Management and QCBioinformatics and Data AnalysisData Computes and StorageData Sharing
  • 6. Ultra High Throughput + Lower Cost = Broader Applications
  • 7. Bioinformatics Tools * BFAST - Blat-like Fast Accurate Search Tool. Written by Nils Homer, Stanley F. Nelson and Barry Merriman at UCLA.* Bowtie - Ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of 25 million reads per hour on a typical workstation with 2 gigabytes of memory. Uses a Burrows-Wheeler-Transformed (BWT) index. Link to discussion thread here. Written by Ben Langmead and Cole Trapnell. Linux, Windows, and Mac OS X.* BWA - Heng Lee's BWT Alignment program - a progression from Maq. BWA is a fast light-weighted tool that aligns short sequences to a sequence database, such as the human reference genome. By default, BWA finds an alignment within edit distance 2 to the query sequence. C++ source.* ELAND - Efficient Large-Scale Alignment of Nucleotide Databases. Whole genome alignments to a reference genome. Written by Illumina author Anthony J. Cox for the Solexa 1G machine.* Exonerate - Various forms of pairwise alignment (including Smith-Waterman-Gotoh) of DNA/protein against a reference. Authors are Guy St C Slater and Ewan Birney from EMBL. C for POSIX.* GenomeMapper - GenomeMapper is a short read mapping tool designed for accurate read alignments. It quickly aligns millions of reads either with ungapped or gapped alignments. A tool created by the 1001 Genomes project. Source for POSIX.* GMAP - GMAP (Genomic Mapping and Alignment Program) for mRNA and EST Sequences. Developed by Thomas Wu and Colin Watanabe at Genentec. C/Perl for Unix.* gnumap - The Genomic Next-generation Universal MAPper (gnumap) is a program designed to accurately map sequence data obtained from next-generation sequencing machines (specifically that of Solexa/Illumina) back to a genome of any size. It seeks to align reads from nonunique repeats using statistics. From authors at Brigham Young University. C source/Unix.* MAQ - Mapping and Assembly with Qualities (renamed from MAPASS2). Particularly designed for Illumina with preliminary functions to handle ABI SOLiD data. Written by Heng Li from the Sanger Centre. Features extensive supporting tools for DIP/SNP detection, etc. C++ source* MOSAIK - MOSAIK produces gapped alignments using the Smith-Waterman algorithm. Features a number of support tools. Support for Roche FLX, Illumina, SOLiD, and Helicos. Written by Michael Strömberg at Boston College. Win/Linux/MacOSX* MrFAST and MrsFAST - mrFAST & mrsFAST are designed to map short reads generated with the Illumina platform to reference genome assemblies; in a fast and memory-efficient manner. Robust to INDELs and MrsFAST has a bisulphite mode. Authors are from the University of Washington. C as source.* MUMmer - MUMmer is a modular system for the rapid whole genome alignment of finished or draft sequence. Released as a package providing an efficient suffix tree library, seed-and-extend alignment, SNP detection, repeat detection, and visualization tools. Version 3.0 was developed by Stefan Kurtz, Adam Phillippy, Arthur L Delcher, Michael Smoot, Martin Shumway, CorinaAntonescu and Steven L Salzberg - most of whom are at The Institute for Genomic Research in Maryland, USA. POSIX OS required.* Novocraft - Tools for reference alignment of paired-end and single-end Illumina reads. Uses a Needleman-Wunsch algorithm. Can support Bis-Seq. Commercial. Available free for evaluation, educational use and for use on open not-for-profit projects. Requires Linux or Mac OS X.* PASS - It supports Illumina, SOLiD and Roche-FLX data formats and allows the user to modulate very finely the sensitivity of the alignments. Spaced seed intial filter, then NW dynamic algorithm to a SW(like) local alignment. Authors are from CRIBI in Italy. Win/Linux.* RMAP - Assembles 20 - 64 bpIllumina reads to a FASTA reference genome. By Andrew D. Smith and ZhenyuXuan at CSHL. (published in BMC Bioinformatics). POSIX OS required.* SeqMap - Supports up to 5 or more bp mismatches/INDELs. Highly tunable. Written by Hui Jiang from the Wong lab at Stanford. Builds available for most OS's.* SHRiMP - Assembles to a reference sequence. Developed with Applied Biosystem'scolourspace genomic representation in mind. Authors are Michael Brudno and Stephen Rumble at the University of Toronto. POSIX.* Slider- An application for the Illumina Sequence Analyzer output that uses the probability files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences. Authors are from BCGSC. Paper is here.* SOAP - SOAP (Short Oligonucleotide Alignment Program). A program for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. The updated version uses a BWT. Can call SNPs and INDELs. Author is Ruiqiang Li at the Beijing Genomics Institute. C++, POSIX.* SSAHA - SSAHA (Sequence Search and Alignment by Hashing Algorithm) is a tool for rapidly finding near exact matches in DNA or protein databases using a hash table. Developed at the Sanger Centre by ZeminNing, Anthony Cox and James Mullikin. C++ for Linux/Alpha.* SOCS - Aligns SOLiD data. SOCS is built on an iterative variation of the Rabin-Karp string search algorithm, which uses hashing to reduce the set of possible matches, drastically increasing search speed. Authors are Ondov B, Varadarajan A, Passalacqua KD and Bergman NH.* SWIFT - The SWIFT suit is a software collection for fast index-based sequence comparison. It contains: SWIFT — fast local alignment search, guaranteeing to find epsilon-matches between two sequences. SWIFT BALSAM — a very fast program to find semiglobal non-gapped alignments based on k-mer seeds. Authors are Kim Rasmussen (SWIFT) and Wolfgang Gerlach (SWIFT BALSAM)* SXOligoSearch - SXOligoSearch is a commercial platform offered by the Malaysian based Synamatix. Will align Illumina reads against a range of Refseq RNA or NCBI genome builds for a number of organisms. Web Portal. OS independent.* Vmatch - A versatile software tool for efficiently solving large scale sequence matching tasks. Vmatch subsumes the software tool REPuter, but is much more general, with a very flexible user interface, and improved space and time requirements. Essentially a large string matching toolbox. POSIX.* Zoom - ZOOM (Zillions Of Oligos Mapped) is designed to map millions of short reads, emerged by next-generation sequencing technology, back to the reference genomes, and carry out post-analysis. ZOOM is developed to be highly accurate, flexible, and user-friendly with speed being a critical priority. Commercial. Supports Illumina and SOLiD data.Courtesy of SeqAnsers.com
  • 9. Choice of Library Construction
  • 13. Number of Samples and Control
  • 16. …and distributingDistributing the ProblemExchanging DataRefreshing Data
  • 17. How do we avoid the Perfect Storm?http://www.flickr.com/photos/nationalmaritimemuseum/3115298137/
  • 20. You can’t do this alone, neither can we.Elaine Mardis, Ph.D.Co-Director, Genome Sequencing CenterWashington University School of MedicineSam Levy, Ph.D.Director of Genomic SciencesProfessor of Translational Genomics & Human Genomic MedicineScripps Translational Science InstituteScripps HealthScripps Research InstituteMichael Zody, Ph.D.Chief TechnologistBroad Institute of MITKen Dewar, Ph.D.Assistant ProfessorMcGill University and Quebec GenomeSteven Salzberg, Ph.D.Director, Center for Bioinformatics and Computational BiologyUniversity of MarylandGabor Marth, Ph.D.Professor of BioinformaticsBoston CollegeElliott Margulies, Ph.D.InvestigatorGenome Informatics SectionNational Human Genome Research InstituteNational Institutes of HealthScientific Advisory Board
  • 23. BioinformaticsCloud Computing (Iaas, PaaS)Amazon, Google, OthersNGS Software and AlgorithmsCommercial (CLC) and Open SourceFrameworks(cloud)Biolinux, Hadoop and ChefData Sharing and Standards (GSC/M5)
  • 24. Cloud
  • 25. CloudNot a talk on cloud computing…Either you already know what it is – OR – Other can do it more justice with more time.The 4 dollar genome.
  • 26. An Academic ProblemHave 100K+ free computes avail (Teragrid)Can’t get to them easilyHave unlimited pool of computes (Amazon)Can’t figure out how to pay for themIf you have figured those out…Traditional IssuesPipes too smallSecurity
  • 27. Edge Science Gateway (ESG)Easily leverage any source at the back endCloudTeraGridInternal HPC ClusterInternal cloud (Eucalyptus)Commercial partners provide thin clients layerCLC Bio
  • 28. Edge Science Gateway (ESG)Plug and Play ToolsChef and (cloud)Bio-LinuxReal people, not services underneath it all.IFX and IT consulting
  • 29. Edge Exome Analysis PipelineBase Call, Quality FilterMapGenomeExon junctionsVariant AnalysisSNPINDELVariant AnnotationRefseq (etc)dbSNP/100GenomesSIFT, PolyPhenOMIMRepeat DatabaseVisualizationFunctional and Structural Analysis
  • 30. Sometimes money isn’t enoughTraditional Infrastructure or New Era CloudsStill costs, whether up front or on back endAlways want to minimize costFrameworks and Algorithms R&DNGS Software and Algorithmscloud(Bio-Linux)HadoopChef
  • 31. NGS Software and AlgorithmsBest of Breed though pluggable architectureMake them faster…Map Reduce Blast & GPU Enable BlastSIMD Accelerated HMMR, ClustalW, SW (CLC Bio)…or run them less.Clustering algorithmsEfficient data refreshesSharing of results
  • 32. FrameworksCloud(BioLinux)Easily deploy others software though virtualizationChef (an Intro)Open Source configuration management for your entire infrastructureMPI and HadoopSpeed up and I/O issue resolution with MPICloudBurst – mapReduce based NGS mapping
  • 33. Edge BioServ Services Project goals and timelinesNumber of samplesNumber of reads/tags per sampleExperiment and Project DesignSample Preparation Library PreparationSample Capture Library ConstructionSample QC and QuantificationAdaptor ligation ProjectWorkflowAmplification & SequenceFragment amplificationNext Generation Sequencing run Align sequence to reference genomeor RNA databaseSecondary and Tertiary AnalysisData Analysis
  • 34. Real World Examples1500+ Sample Epigenetic StudyChallengesTracking
  • 35. QC
  • 42. S3 for Backup (Soon for Delivery)Real World ExamplesHundreds of exomes over the next 6 months.ChallengesPipeline for mapping and variant analysisSolutionMix of commercial, in house, and open source tools.
  • 43. Edge Exome Analysis PipelineBase Call, Quality FilterMapGenomeExon junctionsVariant AnalysisSNPINDELVariant AnnotationRefseq (etc)dbSNP/100GenomesSIFT, PolyPhenOMIMRepeat DatabaseVisualizationFunctional and Structural Analysis
  • 44. EdgeBioExon Coverage StatisticsHow well is the exome covered?~3 Quads on SOLiD 3 Plus <1% of exons have 0 coverage
  • 45. 3% of exons have 0-5x coverage
  • 46. 5% of exons have 0-10x coverage
  • 47. 95% of exons have >10x coverage~1 Quad on SOLiD 3 Plus <2% of exons have 0 coverage
  • 48. 10% of exons have 0-5x coverage
  • 49. 17% of exons have 0-10x coverage
  • 50. 83% of exons have >10x coverageReal World ExamplesTranscriptomes from microbes to mice.ChallengesRapidly evolving fieldVisualizationSolutionsAgain leveraging tools across the board via ESGPlug and Play Best of Breed.
  • 57. Questions?Thank YouTwitter: @BioInfo & @EdgeBioEmail: jjohnson@edgebio.com

Editor's Notes

  • #3: Won’t bore you with the entire history of Edge Bio. Its been around 20 …I am sure there are several of you in this room that found yourself doing something you didn’t expect because NGS has provided new challenges to typical thinking and infrastructure.
  • #4: I am sure you all have seen some sort of variation of this. You, like us, see the value in exploring the next gen landscape, but with such an infection – something has to give.Figure 1 The number of publications with keywords for nucleic acid detection and sequencing technologies. PubMed (http://www. ncbi.nlm.nih.gov/sites/entrez) was searched in two-year increments for key words and the number of hits plotted over time. For 2007–2008, results from January 1–March 31, 2008 were multiplied by four and added to those for 2007. Key words used were those listed in the legend except for new sequencing technologies (‘next-generation sequencing’ or ‘high-throughput sequencing’), ChIP (‘chromatin immunoprecipitation’ or ‘ChIP-Chip’ or ‘ChIPPCR’ or ‘ChIP-Seq’), qPCR (TaqMan or qPCR or ‘real-time PCR’) and SNP analysis (SNPs or ‘single-nucleotide polymorphisms’ and not nitroprusside (nitroprusside is excluded because sodium nitroprusside is sometimes abbreviated as ‘SNP’ but is generally unrelated to genetics)).
  • #5: From the base of the inflection (or tipping point) in 06 - innovation and technology have been in a constant state of flux. Costs per base have plummeted.Still remains as significant capital expenditure in many areas
  • #6: Lets hone in on a couple of those areas.- These are the players in the sequencing field as it is, and as it is emerging. - Where do you start – each are provide data at lower costs,- Each have their strengths and weaknesses, - How do you make sure you choose the right platform for your application.
  • #7: Now, you have a platform – with the high throughout comes an immense range of applications – how do you become proficient?
  • #8: Now you have an application, how do you design your project? And what many here are talking about today – how do you effectively analyze the data?To illustrate the complexity – this is a partial list of read mapping software from SeqAnswersSo testing, comparing, and honing in on an app is important. Then each one can be run in a multitude of ways. How many errors, colorspace, base, space, etc, Daunting,
  • #9: Then there is the execution and operations of the projects, - We all wish we had a bag of money to do the science “in the right way”compromises are made to suit funders, timelines, quality standards, etc.How do we balance those limited resources – lets touch on that later.
  • #10: I tend to start with the inherent challenges of the next gen landscape to lay the ground work for possible solutions.
  • #11: I was lucky to grow up in genomics. 16 was high throughput. Infrastructureas we know it was rendered obsoletesheer scalechemistries (flows and colorspace data) error models.– significant issue lie ahead for next next gen in these areas.
  • #12: Distributing the Problem- lower barrier to entry than before- it has somewhat democratized sequencing,
  • #14: - Service based models becoming common in IT- They allow abstractions, so you don’t have to care about the hardware, or the software, or the platform that is built to accomplish the goals of the research organization or company.
  • #15: Edge abstracts it to the most basic of levels. Our aim is to help researchers with all these considerations, to come out with optimal results. DNA/RNA in  answers out.
  • #16: Though leaders in fields of genomics and bioinformatics.
  • #17: Lets focus a bit more. Bioinformatics is a piece of the puzzle.
  • #18: More recently, it just so happens to be a larger piece.
  • #19: With limited time and so many cool things to talk about – I am going to touch on several. I would love to continue any one of these in a hallway or over coffee. My group’s goal is to leverage many technologies – commercial, academic, and internal to help our clients. Some of the areas we are focusing on are on this slide.
  • #20: SO the first, and probability hottest topic out there is cloud.
  • #21: ScalablePay as you GoRedundant data storageUbiquitous accessLower upfront and capital costs
  • #22: Access resources in non-traditional ways.Paradigm shift needs to happen with funding agencies,think in traditional ways of funding capital infrastructure. We need to continue to educate.
  • #23: - Oneneeds to be nimble in their ability to leverage computes from many pools. - With the ESG, we have built a flexible and scalable infrastructure
  • #24: ESG is built around This plug and play architecture- continue to explore tools - Chef and other virtualization mechanism that ease the burden of infrastructure maintenance.
  • #25: Note how each is a node which we can hotswap tools as needed.
  • #26: This isn’t Panaceathough. There are are costs, whether front or back loaded. SO how do we minimize them? Spend time making things better, not just building pipelinesI’ll touch a bit more on each of these in the upcoming slides.
  • #27: - Best of Breed. - Changing so rapidly. SOwhether we, or someone else figures out a way to do something better we can take advantage.- With software, make it better or run it less
  • #28: How many times have you tried to install software that was sent to you or you downloaded from SourceForge that wouldn’t build or run as planned?We quickly learned with a community based AMI is very serial in its adoption and addition. Chef allows for developers to construct infrastructure much like SW
  • #29: So, lets wind this down by looking at some examples. I guess if you take one term away – I would like it to be Science as a Service. This is a concept, an abstraction for a method of building infrastructure and transparency. So – does it work?
  • #32: Note how each is a node which we can plug in tools as needed.
  • #33: Balance between cost and coverage – this provides a talking point for a slide coming up in a few.
  • #36: What is the best way to do this with what you have?If money not an object, get high coverage and 100% accuracyIn reality, What is the best way to do it with what you have?PlaftormBfx
  • #40: What is the best way to do this with what you have?If money not an object, get high coverage and 100% accuracyIn reality, What is the best way to do it with what you have?PlaftormBfx