Enabling Large Scale Sequencing Studies through Science as a Service

Enabling Large Scale Sequencing Studies through Science as a Service (ScaaS)Justin H. JohnsonDirector of BioinformaticsEdgeBioWashington DC, USA

AgendaWho We AreNGS at 30KChallenges and Enabling Through ScaaSTranscriptome ProjectsExome ProjectsIon Torrent Data

Contract Research DivisionFive SOLiD4 sequencing platformsOne Life Techologies 5500XLTwo Ion Torrent PGMsAutomation thru Caliper Sciclone& BiomekFXLife Technologies Preferred Service ProviderAgilent Certified Service ProviderCommercial partnerships with companies such as CLCBio, DNANexusand GenologicsMD/PhD & Masters Level Scientists and BioinformaticiansIT Infrastructure of >100 CPUs and >100TB storage

Edge BioServScientific Advisory BoardElaine Mardis, Ph.D.Co-Director, Genome Sequencing CenterWashington University School of MedicineSam Levy, Ph.D.Director of Genome SciencesScripps Translational Science InstituteScripps Genomic MedicineMichael Zody, M.S.Chief TechnologistBroad InstituteKen Dewar, Ph.D.Assistant ProfessorMcGill University and Genome QuebecSteven Salzberg, Ph.D.Director, Center for Bioinformatics and Computational BiologyUniversity of MarylandGabor Marth, Ph.D.Professor of BioinformaticsBoston CollegeElliott Margulies, Ph.D.InvestigatorGenome Informatics SectionNational Human Genome Research InstituteNational Institutes of Health

Obligatory NGS Exponential Growth SlideNature Biotechnology Volume 26 Number10 October2008

Ultra High Throughput + Lower Cost = Broader Applications

Experimental Design ConsiderationsSequencing Platform in Use

Choice of Library Construction

Flexibility with Standards and ScaleThen (CE) – The Norm10 Machines, 30 – 360 Days, 1 ProjectNow (Illumina/SOLiD/454) – Scale1 machine, 14 Days, 30 ProjectsNow (Ion Torrent) - Flexibility1 machine, 1 Day, 1 Project.Future (CLCBio, Nexus, Open Source)Standardization of analysis

Partial List of Mappers * BFAST - Blat-like Fast Accurate Search Tool. Written by Nils Homer, Stanley F. Nelson and Barry Merriman at UCLA.* Bowtie - Ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of 25 million reads per hour on a typical workstation with 2 gigabytes of memory. Uses a Burrows-Wheeler-Transformed (BWT) index. Link to discussion thread here. Written by Ben Langmead and Cole Trapnell. Linux, Windows, and Mac OS X.* BWA - Heng Lee's BWT Alignment program - a progression from Maq. BWA is a fast light-weighted tool that aligns short sequences to a sequence database, such as the human reference genome. By default, BWA finds an alignment within edit distance 2 to the query sequence. C++ source.* ELAND - Efficient Large-Scale Alignment of Nucleotide Databases. Whole genome alignments to a reference genome. Written by Illumina author Anthony J. Cox for the Solexa 1G machine.* Exonerate - Various forms of pairwise alignment (including Smith-Waterman-Gotoh) of DNA/protein against a reference. Authors are Guy St C Slater and Ewan Birney from EMBL. C for POSIX.* GenomeMapper - GenomeMapper is a short read mapping tool designed for accurate read alignments. It quickly aligns millions of reads either with ungapped or gapped alignments. A tool created by the 1001 Genomes project. Source for POSIX.* GMAP - GMAP (Genomic Mapping and Alignment Program) for mRNA and EST Sequences. Developed by Thomas Wu and Colin Watanabe at Genentec. C/Perl for Unix.* gnumap - The Genomic Next-generation Universal MAPper (gnumap) is a program designed to accurately map sequence data obtained from next-generation sequencing machines (specifically that of Solexa/Illumina) back to a genome of any size. It seeks to align reads from nonunique repeats using statistics. From authors at Brigham Young University. C source/Unix.* MAQ - Mapping and Assembly with Qualities (renamed from MAPASS2). Particularly designed for Illumina with preliminary functions to handle ABI SOLiD data. Written by Heng Li from the Sanger Centre. Features extensive supporting tools for DIP/SNP detection, etc. C++ source* MOSAIK - MOSAIK produces gapped alignments using the Smith-Waterman algorithm. Features a number of support tools. Support for Roche FLX, Illumina, SOLiD, and Helicos. Written by Michael Strömberg at Boston College. Win/Linux/MacOSX* MrFAST and MrsFAST - mrFAST & mrsFAST are designed to map short reads generated with the Illumina platform to reference genome assemblies; in a fast and memory-efficient manner. Robust to INDELs and MrsFAST has a bisulphite mode. Authors are from the University of Washington. C as source.* MUMmer - MUMmer is a modular system for the rapid whole genome alignment of finished or draft sequence. Released as a package providing an efficient suffix tree library, seed-and-extend alignment, SNP detection, repeat detection, and visualization tools. Version 3.0 was developed by Stefan Kurtz, Adam Phillippy, Arthur L Delcher, Michael Smoot, Martin Shumway, Corina Antonescu and Steven L Salzberg - most of whom are at The Institute for Genomic Research in Maryland, USA. POSIX OS required.* Novocraft - Tools for reference alignment of paired-end and single-end Illumina reads. Uses a Needleman-Wunsch algorithm. Can support Bis-Seq. Commercial. Available free for evaluation, educational use and for use on open not-for-profit projects. Requires Linux or Mac OS X.* PASS - It supports Illumina, SOLiD and Roche-FLX data formats and allows the user to modulate very finely the sensitivity of the alignments. Spaced seed intial filter, then NW dynamic algorithm to a SW(like) local alignment. Authors are from CRIBI in Italy. Win/Linux.* RMAP - Assembles 20 - 64 bp Illumina reads to a FASTA reference genome. By Andrew D. Smith and Zhenyu Xuan at CSHL. (published in BMC Bioinformatics). POSIX OS required.* SeqMap - Supports up to 5 or more bp mismatches/INDELs. Highly tunable. Written by Hui Jiang from the Wong lab at Stanford. Builds available for most OS's.* SHRiMP - Assembles to a reference sequence. Developed with Applied Biosystem's colourspace genomic representation in mind. Authors are Michael Brudno and Stephen Rumble at the University of Toronto. POSIX.* Slider- An application for the Illumina Sequence Analyzer output that uses the probability files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences. Authors are from BCGSC. Paper is here.* SOAP - SOAP (Short Oligonucleotide Alignment Program). A program for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. The updated version uses a BWT. Can call SNPs and INDELs. Author is Ruiqiang Li at the Beijing Genomics Institute. C++, POSIX.* SSAHA - SSAHA (Sequence Search and Alignment by Hashing Algorithm) is a tool for rapidly finding near exact matches in DNA or protein databases using a hash table. Developed at the Sanger Centre by Zemin Ning, Anthony Cox and James Mullikin. C++ for Linux/Alpha.* SOCS - Aligns SOLiD data. SOCS is built on an iterative variation of the Rabin-Karp string search algorithm, which uses hashing to reduce the set of possible matches, drastically increasing search speed. Authors are Ondov B, Varadarajan A, Passalacqua KD and Bergman NH.* SWIFT - The SWIFT suit is a software collection for fast index-based sequence comparison. It contains: SWIFT — fast local alignment search, guaranteeing to find epsilon-matches between two sequences. SWIFT BALSAM — a very fast program to find semiglobal non-gapped alignments based on k-mer seeds. Authors are Kim Rasmussen (SWIFT) and Wolfgang Gerlach (SWIFT BALSAM)* SXOligoSearch - SXOligoSearch is a commercial platform offered by the Malaysian based Synamatix. Will align Illumina reads against a range of Refseq RNA or NCBI genome builds for a number of organisms. Web Portal. OS independent.* Vmatch - A versatile software tool for efficiently solving large scale sequence matching tasks. Vmatch subsumes the software tool REPuter, but is much more general, with a very flexible user interface, and improved space and time requirements. Essentially a large string matching toolbox. POSIX.* Zoom - ZOOM (Zillions Of Oligos Mapped) is designed to map millions of short reads, emerged by next-generation sequencing technology, back to the reference genomes, and carry out post-analysis. ZOOM is developed to be highly accurate, flexible, and user-friendly with speed being a critical priority. Commercial. Supports Illumina and SOLiD data.Courtesy of SeqAnswers.com

Evolving Sequencing & Analysis Methods to Enable Genomic Research

Real World Examples - Scale1500+ Sample Epigenetic StudyChallengesSample Prep (MethyMiner)

QC (Automation and Standardization)

Delivery (Automation and Standardization)SolutionMix of Commercial and Open Tools

S3 for Backup and DeliveryReal World Examples – StandardsRapid sequenced the genome of the Escherichia coli strain from European outbreak“…[University of Münster & Life Tech] ]received the samples on Monday, began sequencing that evening, and began analyzing the data on Wednesday…”“…Justin Johnson, director of bioinformatics at EdgeBio, assembled and analyzed the raw reads made publicly available by BGI using CLC Bio's software…Johnson said his analysis took just a couple of hours…

tiRNAATGAAAAAAATGATGATGAAAAAAATGATGAAAATGAAAgenomic DNAMammalian transcriptionalcomplexityMammalian Transcriptome ComplexityTSSTSSTSSpApApAATGATGTSSpAPASRTASRmiRNAAAAspliced intronmicroRNAsTSSpApolyadenylation signaltranscription start siteprotein coding regionsAAAtranslation start sitepolyadenylationnon-coding regionsATGCourtesy of Life Technologies

RNA-SeqNew Approach to RNA Profiling enabled by Next-Gen Sequencing

Yet based on well-established methodologies

Substantial Benefits over Hybridization-Based Methods

Better quantitative gene expression performance (DGE)

In addition, can allow a comprehensive view of transcription (Whole Transcriptome)

Transcriptome projects overview

Identification of imprinted genes contributing to specific brain regions by whole transcriptome sequencing

24 sample cohort for basic human expression and variant analysis in diseased patients.

32 Sample cohort looking at novel splice junctions, gene fusions, and differential expression of colon cancer samples over a time series

Collaboration with Scripps Translational on Colon Cancer Transciptomes

Sample Sourcing for Transcriptome ProjectsBlood: Large quantities of sample available, but with limited utility in transcriptome analysisTissue: Needle biopsy most common, but sample quantity very lowSurgical section: Larger quantities available, but limited utility; need laser capture microdissection to provide useful results, sample quantity very lowFFPE Slides: Very useful in clinical research but amount of sample and quality low.

Unamplified vs AmplifiedProstate Cancer Cell Line (Vcap) from CPDRWell characterizedDifferential Expression upon the addition of androgens.Compared transcriptome from a single pool of RNAUnamplified, ribosomally depleted (Ribominus™)Amplified, no ribosomal depletion requiredTwo Pipelines for analysis

Amplification Gives Different ResultsGene Expression in Unstimulated CellsUnampAmplified14,07521121071

Spearman’s Correlation from 2 Pipelines

RNA-Seq Analysis Between Pipelines is Either ConcordantAmplified, Stimulated, Pipe AAmplified, Stimulated, Pipe B

Or not…Unamplified, Stimulated, Pipe AUnamplified, Stimulated, Pipe B

Even if you remove all SNORA and SNORDUnamplified, Stimulated, Pipe AUnamplified, Stimulated, Pipe B

NM refseqNR refseqHistones (circles)SNORD/SNORArRNA dotsPolyA Selection vs Ribosomal DepletionCourtesy of Life Technologies

Not what you want to hear…Lots of manual work to run multiple pipelines

Filtering techniques based on YOUR data.

Exome and Targeted Resequencing Capturing and interrogating a portion of the genome in many samples post GWAS

Enabling Large Scale Sequencing Studies through Science as a Service

More Related Content

What's hot (20)

Similar to Enabling Large Scale Sequencing Studies through Science as a Service (20)

Recently uploaded (20)

Enabling Large Scale Sequencing Studies through Science as a Service

Editor's Notes