SlideShare a Scribd company logo
Enabling Large Scale Sequencing Studies through Science as a Service (ScaaS)Justin H. JohnsonDirector of BioinformaticsEdgeBioWashington DC, USA
AgendaWho We AreNGS at 30KChallenges and Enabling Through ScaaSTranscriptome ProjectsExome ProjectsIon Torrent Data
Enabling Large Scale Sequencing Studies through Science as a Service
Life Tech Service Provider
Contract Research DivisionFive SOLiD4 sequencing platformsOne Life Techologies 5500XLTwo Ion Torrent PGMsAutomation thru Caliper Sciclone& BiomekFXLife Technologies Preferred Service ProviderAgilent Certified Service ProviderCommercial partnerships with companies such as CLCBio, DNANexusand GenologicsMD/PhD & Masters Level Scientists and BioinformaticiansIT Infrastructure of >100 CPUs and >100TB storage
Edge BioServScientific Advisory BoardElaine Mardis, Ph.D.Co-Director, Genome Sequencing CenterWashington University School of MedicineSam Levy, Ph.D.Director of Genome SciencesScripps Translational Science InstituteScripps Genomic MedicineMichael Zody, M.S.Chief TechnologistBroad InstituteKen Dewar, Ph.D.Assistant ProfessorMcGill University and Genome QuebecSteven Salzberg, Ph.D.Director, Center for Bioinformatics and Computational BiologyUniversity of MarylandGabor Marth, Ph.D.Professor of BioinformaticsBoston CollegeElliott Margulies, Ph.D.InvestigatorGenome Informatics SectionNational Human Genome Research InstituteNational Institutes of Health
Enabling Large Scale Sequencing Studies through Science as a Service
Machines and VendorsGnuBio
Obligatory NGS Exponential Growth SlideNature Biotechnology Volume 26  Number10  October2008
Ultra High Throughput + Lower Cost = Broader Applications
Enabling Large Scale Sequencing Studies through Science as a Service
Enabling Large Scale Sequencing Studies through Science as a Service
Experimental Design ConsiderationsSequencing Platform in Use
Choice of Library Construction
Depth of coverage
Re$ources
Number of Replicates
Number of Samples and Control
Etc…
Flexibility with Standards and ScaleThen (CE) – The Norm10 Machines, 30 – 360 Days, 1 ProjectNow (Illumina/SOLiD/454) – Scale1 machine, 14 Days, 30 ProjectsNow (Ion Torrent) - Flexibility1 machine, 1 Day, 1 Project.Future (CLCBio, Nexus, Open Source)Standardization of analysis
Partial List of Mappers	* BFAST - Blat-like Fast Accurate Search Tool. Written by Nils Homer, Stanley F. Nelson and Barry Merriman at UCLA.* Bowtie - Ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of 25 million reads per hour on a typical workstation with 2 gigabytes of memory. Uses a Burrows-Wheeler-Transformed (BWT) index. Link to discussion thread here. Written by Ben Langmead and Cole Trapnell. Linux, Windows, and Mac OS X.* BWA - Heng Lee's BWT Alignment program - a progression from Maq. BWA is a fast light-weighted tool that aligns short sequences to a sequence database, such as the human reference genome. By default, BWA finds an alignment within edit distance 2 to the query sequence. C++ source.* ELAND - Efficient Large-Scale Alignment of Nucleotide Databases. Whole genome alignments to a reference genome. Written by Illumina author Anthony J. Cox for the Solexa 1G machine.* Exonerate - Various forms of pairwise alignment (including Smith-Waterman-Gotoh) of DNA/protein against a reference. Authors are Guy St C Slater and Ewan Birney from EMBL. C for POSIX.* GenomeMapper - GenomeMapper is a short read mapping tool designed for accurate read alignments. It quickly aligns millions of reads either with ungapped or gapped alignments. A tool created by the 1001 Genomes project. Source for POSIX.* GMAP - GMAP (Genomic Mapping and Alignment Program) for mRNA and EST Sequences. Developed by Thomas Wu and Colin Watanabe at Genentec. C/Perl for Unix.* gnumap - The Genomic Next-generation Universal MAPper (gnumap) is a program designed to accurately map sequence data obtained from next-generation sequencing machines (specifically that of Solexa/Illumina) back to a genome of any size. It seeks to align reads from nonunique repeats using statistics. From authors at Brigham Young University. C source/Unix.* MAQ - Mapping and Assembly with Qualities (renamed from MAPASS2). Particularly designed for Illumina with preliminary functions to handle ABI SOLiD data. Written by Heng Li from the Sanger Centre. Features extensive supporting tools for DIP/SNP detection, etc. C++ source* MOSAIK - MOSAIK produces gapped alignments using the Smith-Waterman algorithm. Features a number of support tools. Support for Roche FLX, Illumina, SOLiD, and Helicos. Written by Michael Strömberg at Boston College. Win/Linux/MacOSX* MrFAST and MrsFAST - mrFAST & mrsFAST are designed to map short reads generated with the Illumina platform to reference genome assemblies; in a fast and memory-efficient manner. Robust to INDELs and MrsFAST has a bisulphite mode. Authors are from the University of Washington. C as source.* MUMmer - MUMmer is a modular system for the rapid whole genome alignment of finished or draft sequence. Released as a package providing an efficient suffix tree library, seed-and-extend alignment, SNP detection, repeat detection, and visualization tools. Version 3.0 was developed by Stefan Kurtz, Adam Phillippy, Arthur L Delcher, Michael Smoot, Martin Shumway, Corina Antonescu and Steven L Salzberg - most of whom are at The Institute for Genomic Research in Maryland, USA. POSIX OS required.* Novocraft - Tools for reference alignment of paired-end and single-end Illumina reads. Uses a Needleman-Wunsch algorithm. Can support Bis-Seq. Commercial. Available free for evaluation, educational use and for use on open not-for-profit projects. Requires Linux or Mac OS X.* PASS - It supports Illumina, SOLiD and Roche-FLX data formats and allows the user to modulate very finely the sensitivity of the alignments. Spaced seed intial filter, then NW dynamic algorithm to a SW(like) local alignment. Authors are from CRIBI in Italy. Win/Linux.* RMAP - Assembles 20 - 64 bp Illumina reads to a FASTA reference genome. By Andrew D. Smith and Zhenyu Xuan at CSHL. (published in BMC Bioinformatics). POSIX OS required.* SeqMap - Supports up to 5 or more bp mismatches/INDELs. Highly tunable. Written by Hui Jiang from the Wong lab at Stanford. Builds available for most OS's.* SHRiMP - Assembles to a reference sequence. Developed with Applied Biosystem's colourspace genomic representation in mind. Authors are Michael Brudno and Stephen Rumble at the University of Toronto. POSIX.* Slider- An application for the Illumina Sequence Analyzer output that uses the probability files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences. Authors are from BCGSC. Paper is here.* SOAP - SOAP (Short Oligonucleotide Alignment Program). A program for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. The updated version uses a BWT. Can call SNPs and INDELs. Author is Ruiqiang Li at the Beijing Genomics Institute. C++, POSIX.* SSAHA - SSAHA (Sequence Search and Alignment by Hashing Algorithm) is a tool for rapidly finding near exact matches in DNA or protein databases using a hash table. Developed at the Sanger Centre by Zemin Ning, Anthony Cox and James Mullikin. C++ for Linux/Alpha.* SOCS - Aligns SOLiD data. SOCS is built on an iterative variation of the Rabin-Karp string search algorithm, which uses hashing to reduce the set of possible matches, drastically increasing search speed. Authors are Ondov B, Varadarajan A, Passalacqua KD and Bergman NH.* SWIFT - The SWIFT suit is a software collection for fast index-based sequence comparison. It contains: SWIFT — fast local alignment search, guaranteeing to find epsilon-matches between two sequences. SWIFT BALSAM — a very fast program to find semiglobal non-gapped alignments based on k-mer seeds. Authors are Kim Rasmussen (SWIFT) and Wolfgang Gerlach (SWIFT BALSAM)* SXOligoSearch - SXOligoSearch is a commercial platform offered by the Malaysian based Synamatix. Will align Illumina reads against a range of Refseq RNA or NCBI genome builds for a number of organisms. Web Portal. OS independent.* Vmatch - A versatile software tool for efficiently solving large scale sequence matching tasks. Vmatch subsumes the software tool REPuter, but is much more general, with a very flexible user interface, and improved space and time requirements. Essentially a large string matching toolbox. POSIX.* Zoom - ZOOM (Zillions Of Oligos Mapped) is designed to map millions of short reads, emerged by next-generation sequencing technology, back to the reference genomes, and carry out post-analysis. ZOOM is developed to be highly accurate, flexible, and user-friendly with speed being a critical priority. Commercial. Supports Illumina and SOLiD data.Courtesy of SeqAnswers.com
Evolving Sequencing & Analysis Methods to Enable Genomic Research
Real World Examples - Scale1500+ Sample Epigenetic StudyChallengesSample Prep (MethyMiner)
Tracking (LIMS)
QC (Automation and Standardization)
Delivery (Automation and Standardization)SolutionMix of Commercial and Open Tools
CLC Bio and Genologics
Custom Algorithms
HPC and Storage
Onsite 100 TB NAS
S3 for Backup and DeliveryReal World Examples – StandardsRapid sequenced the genome of the Escherichia coli strain from European outbreak“…[University of Münster & Life Tech] ]received the samples on Monday, began sequencing that evening, and began analyzing the data on Wednesday…”“…Justin Johnson, director of bioinformatics at EdgeBio, assembled and analyzed the raw reads made publicly available by BGI using CLC Bio's software…Johnson said his analysis took just a couple of hours…
Enabling Large Scale Sequencing Studies through Science as a Service
tiRNAATGAAAAAAATGATGATGAAAAAAATGATGAAAATGAAAgenomic DNAMammalian transcriptionalcomplexityMammalian Transcriptome ComplexityTSSTSSTSSpApApAATGATGTSSpAPASRTASRmiRNAAAAspliced intronmicroRNAsTSSpApolyadenylation signaltranscription start siteprotein coding regionsAAAtranslation start sitepolyadenylationnon-coding regionsATGCourtesy of Life Technologies
RNA-SeqNew Approach to RNA Profiling enabled by Next-Gen Sequencing
Yet based on well-established methodologies
Substantial Benefits over Hybridization-Based Methods
Better quantitative gene expression performance (DGE)
In addition, can allow a comprehensive view of transcription (Whole Transcriptome)
Transcriptome projects overview
Identification of imprinted genes contributing to specific brain regions by whole transcriptome sequencing
24 sample cohort for basic human expression and variant analysis in diseased patients.
32 Sample cohort looking at  novel splice junctions, gene fusions, and differential expression of colon cancer samples over a time series
Collaboration with Scripps Translational on Colon Cancer Transciptomes
Sample Sourcing for Transcriptome ProjectsBlood: Large quantities of sample available, but with limited utility in transcriptome analysisTissue: Needle biopsy most common, but sample quantity very lowSurgical section: Larger quantities available, but limited utility; need laser capture microdissection to provide useful results, sample quantity very lowFFPE Slides: Very useful in clinical research but amount of sample and quality low.
Unamplified vs AmplifiedProstate Cancer Cell Line (Vcap) from CPDRWell characterizedDifferential Expression upon the addition of androgens.Compared transcriptome from a single pool of RNAUnamplified, ribosomally depleted (Ribominus™)Amplified, no ribosomal depletion requiredTwo Pipelines for analysis
Amplification Gives Different ResultsGene Expression in Unstimulated CellsUnampAmplified14,07521121071
Spearman’s Correlation from 2 Pipelines
Enabling Large Scale Sequencing Studies through Science as a Service
RNA-Seq Analysis Between Pipelines is Either ConcordantAmplified, Stimulated, Pipe AAmplified, Stimulated, Pipe B
Or not…Unamplified, Stimulated, Pipe AUnamplified, Stimulated, Pipe B
Even if you remove all SNORA and SNORDUnamplified, Stimulated, Pipe AUnamplified, Stimulated, Pipe B
NM refseqNR refseqHistones (circles)SNORD/SNORArRNA dotsPolyA Selection vs Ribosomal DepletionCourtesy of Life Technologies
Enabling Large Scale Sequencing Studies through Science as a Service
Not what you want to hear…Lots of manual work to run multiple pipelines
Join discordance
Scripting
Visualization
Filtering techniques based on YOUR data.
Exome and Targeted Resequencing Capturing and interrogating a portion of the genome in many samples post GWAS
Fine map a region

More Related Content

PDF
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
PDF
wings2014 Workshop 1 Design, sequence, align, count, visualize
POT
RNA-seq quality control and pre-processing
PDF
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
 
PDF
Computational infrastructure for NGS data analysis
PDF
PDF
2015 09-29-sbc322-methods.key
PPT
Rna seq pipeline
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
wings2014 Workshop 1 Design, sequence, align, count, visualize
RNA-seq quality control and pre-processing
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
 
Computational infrastructure for NGS data analysis
2015 09-29-sbc322-methods.key
Rna seq pipeline

What's hot (20)

PDF
Leveraging ancestral state reconstruction to infer community function from a ...
PPTX
RNA-seq differential expression analysis
PDF
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
PPTX
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
PDF
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
PDF
RNASeq Experiment Design
PPTX
Molecular Biology Software Links
PPTX
Whole exome sequencing(wes)
PPTX
Rna seq
PDF
Long read sequencing - LSCC lab talk - fri 5 june 2015
PPTX
RNASeq - Analysis Pipeline for Differential Expression
PDF
Variant analysis and whole exome sequencing
PPTX
Bioinfo ngs data format visualization v2
PPTX
GLBIO/CCBC Metagenomics Workshop
PPTX
Differential gene expression
PPTX
Transcript detection in RNAseq
PPTX
NGS: bioinformatic challenges
PDF
Overview of methods for variant calling from next-generation sequence data
PDF
ChIP-seq - Data processing
PPTX
diffReps: automated ChIP-seq differential analysis package
Leveraging ancestral state reconstruction to infer community function from a ...
RNA-seq differential expression analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
RNASeq Experiment Design
Molecular Biology Software Links
Whole exome sequencing(wes)
Rna seq
Long read sequencing - LSCC lab talk - fri 5 june 2015
RNASeq - Analysis Pipeline for Differential Expression
Variant analysis and whole exome sequencing
Bioinfo ngs data format visualization v2
GLBIO/CCBC Metagenomics Workshop
Differential gene expression
Transcript detection in RNAseq
NGS: bioinformatic challenges
Overview of methods for variant calling from next-generation sequence data
ChIP-seq - Data processing
diffReps: automated ChIP-seq differential analysis package
Ad

Similar to Enabling Large Scale Sequencing Studies through Science as a Service (20)

PPTX
Closing the Gap in Time: From Raw Data to Real Science
PPTX
Bioinformatics_1_ChenS.pptx
PPTX
Cool Informatics Tools and Services for Biomedical Research
PPTX
PACBIO SEQUENCING - PRINCIPLE, TYPES, APPLICATION, ADVANTAGE AND DISADVANTAGE
PDF
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
 
PPTX
Under the Hood of Alignment Algorithms for NGS Researchers
PDF
2015_CV_J_SHELTON_linked
PPTX
Computational Resources In Infectious Disease
PPTX
Cloud bioinformatics 2
PPTX
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
DOC
PPTX
Dgaston dec-06-2012
PPTX
Understanding Genome
PPTX
Bioinformatic tool for Annotation of gene
PDF
Accelerating GWAS epistatic interaction analysis methods
ODP
OVium Bioinformatic Solutions
PPT
Chambwe bosc2010
PDF
Genome res. 2002-kent-656-64
PDF
Genome res. 2002-kent-656-64
Closing the Gap in Time: From Raw Data to Real Science
Bioinformatics_1_ChenS.pptx
Cool Informatics Tools and Services for Biomedical Research
PACBIO SEQUENCING - PRINCIPLE, TYPES, APPLICATION, ADVANTAGE AND DISADVANTAGE
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
 
Under the Hood of Alignment Algorithms for NGS Researchers
2015_CV_J_SHELTON_linked
Computational Resources In Infectious Disease
Cloud bioinformatics 2
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Dgaston dec-06-2012
Understanding Genome
Bioinformatic tool for Annotation of gene
Accelerating GWAS epistatic interaction analysis methods
OVium Bioinformatic Solutions
Chambwe bosc2010
Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64
Ad

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Machine learning based COVID-19 study performance prediction
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Cloud computing and distributed systems.
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPT
Teaching material agriculture food technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
KodekX | Application Modernization Development
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Electronic commerce courselecture one. Pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
Digital-Transformation-Roadmap-for-Companies.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Machine learning based COVID-19 study performance prediction
Diabetes mellitus diagnosis method based random forest with bat algorithm
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Empathic Computing: Creating Shared Understanding
Cloud computing and distributed systems.
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
MIND Revenue Release Quarter 2 2025 Press Release
Teaching material agriculture food technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
KodekX | Application Modernization Development
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Electronic commerce courselecture one. Pdf
Advanced methodologies resolving dimensionality complications for autism neur...

Enabling Large Scale Sequencing Studies through Science as a Service

  • 1. Enabling Large Scale Sequencing Studies through Science as a Service (ScaaS)Justin H. JohnsonDirector of BioinformaticsEdgeBioWashington DC, USA
  • 2. AgendaWho We AreNGS at 30KChallenges and Enabling Through ScaaSTranscriptome ProjectsExome ProjectsIon Torrent Data
  • 5. Contract Research DivisionFive SOLiD4 sequencing platformsOne Life Techologies 5500XLTwo Ion Torrent PGMsAutomation thru Caliper Sciclone& BiomekFXLife Technologies Preferred Service ProviderAgilent Certified Service ProviderCommercial partnerships with companies such as CLCBio, DNANexusand GenologicsMD/PhD & Masters Level Scientists and BioinformaticiansIT Infrastructure of >100 CPUs and >100TB storage
  • 6. Edge BioServScientific Advisory BoardElaine Mardis, Ph.D.Co-Director, Genome Sequencing CenterWashington University School of MedicineSam Levy, Ph.D.Director of Genome SciencesScripps Translational Science InstituteScripps Genomic MedicineMichael Zody, M.S.Chief TechnologistBroad InstituteKen Dewar, Ph.D.Assistant ProfessorMcGill University and Genome QuebecSteven Salzberg, Ph.D.Director, Center for Bioinformatics and Computational BiologyUniversity of MarylandGabor Marth, Ph.D.Professor of BioinformaticsBoston CollegeElliott Margulies, Ph.D.InvestigatorGenome Informatics SectionNational Human Genome Research InstituteNational Institutes of Health
  • 9. Obligatory NGS Exponential Growth SlideNature Biotechnology Volume 26 Number10 October2008
  • 10. Ultra High Throughput + Lower Cost = Broader Applications
  • 14. Choice of Library Construction
  • 18. Number of Samples and Control
  • 20. Flexibility with Standards and ScaleThen (CE) – The Norm10 Machines, 30 – 360 Days, 1 ProjectNow (Illumina/SOLiD/454) – Scale1 machine, 14 Days, 30 ProjectsNow (Ion Torrent) - Flexibility1 machine, 1 Day, 1 Project.Future (CLCBio, Nexus, Open Source)Standardization of analysis
  • 21. Partial List of Mappers * BFAST - Blat-like Fast Accurate Search Tool. Written by Nils Homer, Stanley F. Nelson and Barry Merriman at UCLA.* Bowtie - Ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of 25 million reads per hour on a typical workstation with 2 gigabytes of memory. Uses a Burrows-Wheeler-Transformed (BWT) index. Link to discussion thread here. Written by Ben Langmead and Cole Trapnell. Linux, Windows, and Mac OS X.* BWA - Heng Lee's BWT Alignment program - a progression from Maq. BWA is a fast light-weighted tool that aligns short sequences to a sequence database, such as the human reference genome. By default, BWA finds an alignment within edit distance 2 to the query sequence. C++ source.* ELAND - Efficient Large-Scale Alignment of Nucleotide Databases. Whole genome alignments to a reference genome. Written by Illumina author Anthony J. Cox for the Solexa 1G machine.* Exonerate - Various forms of pairwise alignment (including Smith-Waterman-Gotoh) of DNA/protein against a reference. Authors are Guy St C Slater and Ewan Birney from EMBL. C for POSIX.* GenomeMapper - GenomeMapper is a short read mapping tool designed for accurate read alignments. It quickly aligns millions of reads either with ungapped or gapped alignments. A tool created by the 1001 Genomes project. Source for POSIX.* GMAP - GMAP (Genomic Mapping and Alignment Program) for mRNA and EST Sequences. Developed by Thomas Wu and Colin Watanabe at Genentec. C/Perl for Unix.* gnumap - The Genomic Next-generation Universal MAPper (gnumap) is a program designed to accurately map sequence data obtained from next-generation sequencing machines (specifically that of Solexa/Illumina) back to a genome of any size. It seeks to align reads from nonunique repeats using statistics. From authors at Brigham Young University. C source/Unix.* MAQ - Mapping and Assembly with Qualities (renamed from MAPASS2). Particularly designed for Illumina with preliminary functions to handle ABI SOLiD data. Written by Heng Li from the Sanger Centre. Features extensive supporting tools for DIP/SNP detection, etc. C++ source* MOSAIK - MOSAIK produces gapped alignments using the Smith-Waterman algorithm. Features a number of support tools. Support for Roche FLX, Illumina, SOLiD, and Helicos. Written by Michael Strömberg at Boston College. Win/Linux/MacOSX* MrFAST and MrsFAST - mrFAST & mrsFAST are designed to map short reads generated with the Illumina platform to reference genome assemblies; in a fast and memory-efficient manner. Robust to INDELs and MrsFAST has a bisulphite mode. Authors are from the University of Washington. C as source.* MUMmer - MUMmer is a modular system for the rapid whole genome alignment of finished or draft sequence. Released as a package providing an efficient suffix tree library, seed-and-extend alignment, SNP detection, repeat detection, and visualization tools. Version 3.0 was developed by Stefan Kurtz, Adam Phillippy, Arthur L Delcher, Michael Smoot, Martin Shumway, Corina Antonescu and Steven L Salzberg - most of whom are at The Institute for Genomic Research in Maryland, USA. POSIX OS required.* Novocraft - Tools for reference alignment of paired-end and single-end Illumina reads. Uses a Needleman-Wunsch algorithm. Can support Bis-Seq. Commercial. Available free for evaluation, educational use and for use on open not-for-profit projects. Requires Linux or Mac OS X.* PASS - It supports Illumina, SOLiD and Roche-FLX data formats and allows the user to modulate very finely the sensitivity of the alignments. Spaced seed intial filter, then NW dynamic algorithm to a SW(like) local alignment. Authors are from CRIBI in Italy. Win/Linux.* RMAP - Assembles 20 - 64 bp Illumina reads to a FASTA reference genome. By Andrew D. Smith and Zhenyu Xuan at CSHL. (published in BMC Bioinformatics). POSIX OS required.* SeqMap - Supports up to 5 or more bp mismatches/INDELs. Highly tunable. Written by Hui Jiang from the Wong lab at Stanford. Builds available for most OS's.* SHRiMP - Assembles to a reference sequence. Developed with Applied Biosystem's colourspace genomic representation in mind. Authors are Michael Brudno and Stephen Rumble at the University of Toronto. POSIX.* Slider- An application for the Illumina Sequence Analyzer output that uses the probability files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences. Authors are from BCGSC. Paper is here.* SOAP - SOAP (Short Oligonucleotide Alignment Program). A program for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. The updated version uses a BWT. Can call SNPs and INDELs. Author is Ruiqiang Li at the Beijing Genomics Institute. C++, POSIX.* SSAHA - SSAHA (Sequence Search and Alignment by Hashing Algorithm) is a tool for rapidly finding near exact matches in DNA or protein databases using a hash table. Developed at the Sanger Centre by Zemin Ning, Anthony Cox and James Mullikin. C++ for Linux/Alpha.* SOCS - Aligns SOLiD data. SOCS is built on an iterative variation of the Rabin-Karp string search algorithm, which uses hashing to reduce the set of possible matches, drastically increasing search speed. Authors are Ondov B, Varadarajan A, Passalacqua KD and Bergman NH.* SWIFT - The SWIFT suit is a software collection for fast index-based sequence comparison. It contains: SWIFT — fast local alignment search, guaranteeing to find epsilon-matches between two sequences. SWIFT BALSAM — a very fast program to find semiglobal non-gapped alignments based on k-mer seeds. Authors are Kim Rasmussen (SWIFT) and Wolfgang Gerlach (SWIFT BALSAM)* SXOligoSearch - SXOligoSearch is a commercial platform offered by the Malaysian based Synamatix. Will align Illumina reads against a range of Refseq RNA or NCBI genome builds for a number of organisms. Web Portal. OS independent.* Vmatch - A versatile software tool for efficiently solving large scale sequence matching tasks. Vmatch subsumes the software tool REPuter, but is much more general, with a very flexible user interface, and improved space and time requirements. Essentially a large string matching toolbox. POSIX.* Zoom - ZOOM (Zillions Of Oligos Mapped) is designed to map millions of short reads, emerged by next-generation sequencing technology, back to the reference genomes, and carry out post-analysis. ZOOM is developed to be highly accurate, flexible, and user-friendly with speed being a critical priority. Commercial. Supports Illumina and SOLiD data.Courtesy of SeqAnswers.com
  • 22. Evolving Sequencing & Analysis Methods to Enable Genomic Research
  • 23. Real World Examples - Scale1500+ Sample Epigenetic StudyChallengesSample Prep (MethyMiner)
  • 25. QC (Automation and Standardization)
  • 26. Delivery (Automation and Standardization)SolutionMix of Commercial and Open Tools
  • 27. CLC Bio and Genologics
  • 31. S3 for Backup and DeliveryReal World Examples – StandardsRapid sequenced the genome of the Escherichia coli strain from European outbreak“…[University of Münster & Life Tech] ]received the samples on Monday, began sequencing that evening, and began analyzing the data on Wednesday…”“…Justin Johnson, director of bioinformatics at EdgeBio, assembled and analyzed the raw reads made publicly available by BGI using CLC Bio's software…Johnson said his analysis took just a couple of hours…
  • 33. tiRNAATGAAAAAAATGATGATGAAAAAAATGATGAAAATGAAAgenomic DNAMammalian transcriptionalcomplexityMammalian Transcriptome ComplexityTSSTSSTSSpApApAATGATGTSSpAPASRTASRmiRNAAAAspliced intronmicroRNAsTSSpApolyadenylation signaltranscription start siteprotein coding regionsAAAtranslation start sitepolyadenylationnon-coding regionsATGCourtesy of Life Technologies
  • 34. RNA-SeqNew Approach to RNA Profiling enabled by Next-Gen Sequencing
  • 35. Yet based on well-established methodologies
  • 36. Substantial Benefits over Hybridization-Based Methods
  • 37. Better quantitative gene expression performance (DGE)
  • 38. In addition, can allow a comprehensive view of transcription (Whole Transcriptome)
  • 40. Identification of imprinted genes contributing to specific brain regions by whole transcriptome sequencing
  • 41. 24 sample cohort for basic human expression and variant analysis in diseased patients.
  • 42. 32 Sample cohort looking at novel splice junctions, gene fusions, and differential expression of colon cancer samples over a time series
  • 43. Collaboration with Scripps Translational on Colon Cancer Transciptomes
  • 44. Sample Sourcing for Transcriptome ProjectsBlood: Large quantities of sample available, but with limited utility in transcriptome analysisTissue: Needle biopsy most common, but sample quantity very lowSurgical section: Larger quantities available, but limited utility; need laser capture microdissection to provide useful results, sample quantity very lowFFPE Slides: Very useful in clinical research but amount of sample and quality low.
  • 45. Unamplified vs AmplifiedProstate Cancer Cell Line (Vcap) from CPDRWell characterizedDifferential Expression upon the addition of androgens.Compared transcriptome from a single pool of RNAUnamplified, ribosomally depleted (Ribominus™)Amplified, no ribosomal depletion requiredTwo Pipelines for analysis
  • 46. Amplification Gives Different ResultsGene Expression in Unstimulated CellsUnampAmplified14,07521121071
  • 49. RNA-Seq Analysis Between Pipelines is Either ConcordantAmplified, Stimulated, Pipe AAmplified, Stimulated, Pipe B
  • 50. Or not…Unamplified, Stimulated, Pipe AUnamplified, Stimulated, Pipe B
  • 51. Even if you remove all SNORA and SNORDUnamplified, Stimulated, Pipe AUnamplified, Stimulated, Pipe B
  • 52. NM refseqNR refseqHistones (circles)SNORD/SNORArRNA dotsPolyA Selection vs Ribosomal DepletionCourtesy of Life Technologies
  • 54. Not what you want to hear…Lots of manual work to run multiple pipelines
  • 59. Exome and Targeted Resequencing Capturing and interrogating a portion of the genome in many samples post GWAS
  • 60. Fine map a region
  • 62. Catalogue variants for downstream filtering and identification of causative mutation(s)
  • 63. Exome and Targeted Resequencing projects overview
  • 64. Identification of the genetic basis of colorectal cancer through exome sequencing
  • 65. 600+ sample cohort to identify the genetic basis of a novel syndrome
  • 66. Exome sequencing of Tumor/Normal Leukemia patients to identify novel mutations present in tumor samples
  • 67. Exome sequencing of a large cohort (80+) to identify novel mutations linked to phenotypic changes
  • 68. Targeted Capture TechnologiesNimblegenSeqCap EZAgilent SureSelectNimblegenSeqCap EZFebitHybSelectAgilent SureSelectLR-PCRRaindance TechnologiesFluidigm20Kb1 MB2 MB3 MB4 MB5 MB30-50MBExomeGenomic Region Captured
  • 70. Ultimately Comes to VariationCoverageProject DesignCohortsCancerAlgorithms a Solved Problem?Single open source pipelinesSingle commercial pipelinesProprietary internal algorithms.A mixture?
  • 71. Ultimately Comes to VariationCoverageProject DesignCohortsCancerAlgorithms Solved Problem?Single open source pipelinesSingle commercial pipelinesProprietary internal algorithms.A mixture?
  • 73. EdgeBio Exon Coverage StatisticsHow well is the exome covered?** Data from Fragment Runs – Since moving to PE, seeing 15% improvement
  • 74. Venter Genome - AlgorithmsPLOS genetics 2008 vol 4 issue 8 e10000160~21K SNP in exons (29MB Targeted)36,206 expected SNPs for 50MB Kit
  • 75. 3 Tools and Associated SNP CountsSoftware A45,551Software B29,814Software C40,964
  • 76. Software B v. Software AA45,511B29,81421,25024,2618,564Union: 54,075Intersection: 21,250Not to Scale
  • 77. Software B v. Software CC40,964B29,81423,45617,5086,358Union: 47,322Intersection 23,456
  • 78. Software A v. Software CC40,964A45,51130,77310,19114,738Union: 55,702Intersection: 30,773
  • 81. Again not what you want to hear…Lots of manual/semi-automated work to run multiple pipelines
  • 85. Better algorithms for variant calling
  • 87. Standardization of algorithms for variant calling
  • 88. It all begins with mappingExome Analysis – Cancer SpecificDana Farber Cancer InstituteMulti-Pipeline Variant Calling and LOHLoss of heterozygosity detection in tumor vsgermline exome: candidate LOH genes selected with the following algorithmNon-synonymous heterozygous SNP in germline gene
  • 89. Non-synonymous homozygous SNP in tumor or additional Non-synonymous heterozygous SNP on the other allele
  • 90. Ion Torrent PGMLonger, Accurate Reads in 2.5 HoursMicrobial & Viral ResequencingMicrobial & Viral De novo ApplicationsEukaryotic Amplicon SequencingMetagenomicsWGS16S Surveys
  • 94. Real World Examples – SpeedRapid sequenced the genome of the Escherichia coli strain from European outbreak“…[University of Münster & Life Tech] ]received the samples on Monday, began sequencing that evening, and began analyzing the data on Wednesday…”“…Justin Johnson, director of bioinformatics at EdgeBio, assembled and analyzed the raw reads made publicly available by BGI using CLC Bio's software…Johnson said his analysis took just a couple of hours…
  • 95. AcknowledgementsCPDR (Center for Prostate Disease Research) CollaborationShyh-Han Tan, Ph.D.DNA Farber Cancer Institute CollaborationAndrew Lane M.D.,Ph.D.; David Weinstock M.D.; Oliver Weigert M.D.,Ph.DScripps Translational HealthSamuel LevySequencing Team led by Joy Adigun EdgeBio Research IFX led by John Seed, Ph.D. and Quang Nguyen MD, Ph.D.

Editor's Notes

  • #4: Evolving Sequencing Methods to Enable Genomic Research
  • #7: Every house is built with a sturdy foundation.
  • #8: Evolving Sequencing Methods to Enable Genomic Research
  • #9: Because of this…
  • #10: We have this…
  • #11: Which allows this…As a CRO – we especially see how this is happening with those that may not have had access to these applications before due to access or finances.
  • #12: But with constantly expanding applications come…
  • #13: How does one stay technically relevant in a dramatically changing landscape?
  • #14: With sequencing becoming ubiquitous – not as simple as just sequence then science…Many questions to answer and expertise to be gained to make each project successful. We spend upwards of 25% of our time in this phase.
  • #16: We now have the issues of scale, compressed timelines, and standardization of sample prep and informatics.
  • #17: To illustrate the informatics challenge of standardization…Each can be run in hundreds of combinations to produce answers. All different.
  • #18: But when challenges are addressed, there can be immense power in discovery and eventually diagnostics. I will quickly mention 2 current projects that highlight and address some of the challenges, then jump into Transcriptome, Exome, and Ion Torrent sequencing.
  • #20: I can share a bit more of the finding later on..
  • #23: Key Points:New approach enabled by NGS , but it’s based on mature methodsIn highlighting the two benefits, say “which is called” DGE/RNA-Seq respectively to initially define the two terms. The next two slides clarify these definitions.Old Slide below:A (somewhat) new approach to RNA profiling using Sequencing rather than HybridizationVariations on the theme have been used since mid-90sEST Sequencing, SAGE, LongSAGE, MPSSHowever, limitations and cost of sequencing technology, as well as lack of a finished, well-annotated genome reference, had prevented broad use vs. microarrays Digital Gene Expression using Next-Gen SequencingA transformative technology Improved sensitivity, dynamic range, and linearity over microarraysRemoves background and biases seen with microarraysCan provide a comprehensive view of splicing and transcription If desired, not required.Now dozens of published papers validating the approachAggressively competing vendors making it better/faster/cheaper
  • #25: We’re no longer in the early stages of technology adoption, and the biology is becoming more important. More accurate biology requires more refined samples, and that leads to issues in NGS which generally has voracious material requirements. Amplification is generally the solution, but this leads to additional problems in analysis.
  • #27: Total number of genes expressed as a function of the union of the two methods was 17,258. The picture was the same for Stimulated cells except that androgen stimulation very slightly reduced the complexity of the transcriptome with a total of 17,128 genes expressed in the union of the two methods. Different results doesn’t necessarily mean one more accurate…just different. This analysis was performed with a single RNA-Seq pipeline. We subsequently discovered that different pipelines also give different results.
  • #28: You can conclude that the very significant changes in biology associated with androgen stimulation are more closely correlated than using the different methods of sample preparation. The data also suggest that different analytic tools may have a greater impact than the biology as well.
  • #29: It used to be sequencing chewed up the costs for projects, now inverse.
  • #32: The next slide shows the difference between ribosomal depletion and poly(A)+ selection in the distribution of genes. Integrating the informatics pipelines with sample preparation methods and researcher’s needs is critical. There are amplification methods that don’t have the particular bias of the method used in these studies.
  • #39: Again choice and goal of project is paramount when choosing and designing a capture
  • #43: Start to lose your return on investment.
  • #54: Wrap up with Ion Torret…
  • #59: Nature Preceeding