SlideShare a Scribd company logo
THE GOBY FRAMEWORK: TOWARDS EFFICIENT NEXT-GENERATION SEQUENCING DATA ANALYSIS Nyasha Chambwe , Kevin C. Dorff, Marko Srdanovic,  Xutao Deng, Stuart J.D. Andrews, Fabien Campagne The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine;  Department of Physiology and Biophysics Weill Medical College of Cornell University http://goby.campagnelab.org
Applications of Next Generation Sequencing McPherson J.D. Nat Methods. 2009
Next  Generation  Sequencers Metzker, M.L.  Nat Rev Genet. 2010 Roche/454  GS FLX Titanium Illumina/Solexa  GA IIe Life Technologies SOLiD 3 Helicos BioSciences Heliscope NGS Chemistry Pyrosequencing Reversible Terminators Sequencing by ligation Reversible Terminators Avg Read Length (bp) 330 75 50 32 Run Time (days) 0.35 4 7 8 Giga bases/run 0.45 18 30 37 Million reads/run 1.36 240 600 1156
Next Generation Sequence Data Formats Key Limitations Text based formats do not scale well to handle large amounts of data Naïve compression prevents semi-random access
File Format Wish List Structured schema/data representation  Well specified and documented (not ambiguous) Fast parsing speed Language and operating system portability Backward and forward compatibility Compression Random access Streaming
File Formats Low Level APIs Tools/Utilities Applications Java, C++, Python RNA-Seq Pipeline IGV Plug-in The Goby Software Framework reads alignments histograms Readers Writers Iterators File  Format Conversions Alignment Processing Visualization
File Formats Low Level APIs Tools/Utilities Applications Java, C++, Python RNA-Seq Pipeline IGV Plug-in The Goby Software Framework Readers Writers Iterators File  Format Conversions Alignment Processing Visualization
Structured non-ambiguous representation Goby uses Protocol Buffers (PB) to provide “a flexible, efficient, automated mechanism for serializing structured data” (PB website) PB generate parsers in different languages e.g., Java, C++, Python, Perl, R, C, C#, Visual Basic, PHP, Objective C, Ruby, Common Lisp Provide forward and backward compatibility
Goby compact formats Data is represented by Protocol Buffers as a message defined by a .proto file
File Format Wish List Structured schema/data representation  Well specified and documented (not ambiguous) Fast parsing speed Language and operating system portability Backward and forward compatibility Compression Random access Streaming
Goby compact formats Chunking: Semi-random access Efficient parallel processing
File Format Wish List Structured schema/data representation  Well specified and documented (not ambiguous) Fast parsing speed Language and operating system portability Backward and forward compatibility Compression Random access Streaming
Goby File Size Comparisons MAQC sample B = Ambion Human Brain Reference RNA (HBRR or HBR, Catalog #6050)  sequenced on four next-gen  platforms
File Formats Low Level APIs Tools/Utilities Applications Java, C++, Python Readers Writers Iterators The Goby Software Framework reads alignments histograms File  Format Conversions Alignment Processing Visualization RNA-Seq Pipeline IGV Plug-in
File Formats Low Level APIs Tools/Utilities Applications Java, C++, Python Readers Writers Iterators RNA-Seq Pipeline IGV Plug-in The Goby Software Framework reads alignments histograms File  Format Conversions Alignment Processing Visualization
Alignment Iterator Code fragment to: Scan through two alignments (input1, input2) Print information for each entry Print information for chromosomes 1,2,X only
File Formats Low Level APIs Tools/Utilities Applications Java, C++, Python RNA-Seq Pipeline IGV Plug-in The Goby Software Framework reads alignments histograms Readers Writers Iterators File  Format Conversions Alignment Processing Visualization
File Formats Low Level APIs Tools/Utilities Applications Java, C++, Python RNA-Seq Pipeline IGV Plug-in The Goby Software Framework reads alignments histograms Readers Writers Iterators File  Format Conversions Alignment Processing Visualization
RNA-Seq Pipeline Objective: To determine levels of expression in samples and perform differential expression analysis Supports: Mapping to full genome Mapping to annotated cDNAs (reads match inside exons and across exon-exon boundaries) Sequencing platform independent Published normalization methods implemented  Mortazavi A et al. Nat Methods. 2008   Bullard JH et al. BMC Bioinformatics. 2010 Bias correction for platform specific biases  Hansen KD et al. Nucleic Acids Res. 2010
Sample RNA-Seq Results
Conclusion Goby file formats are efficient and non-ambiguous  Alignments are about five times smaller than BAM alignments  API makes it easy to write efficient code to handle large datasets Framework provides utilities and analysis pipelines for common NGS data analysis tasks
Acknowledgements Campagne Lab Fabien Campagne  Kevin C. Dorff Marko Srdanovic Stuart J.D. Andrews Broad Institute Jim Robinson http://goby.campagnelab.org FDA/NCTR Leming Shi  Sequencing Quality Control  Project (SEQC) Helicos Illumina  Life Technologies  Roche
 
cDNA Search

More Related Content

PDF
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
PPTX
Experiences with logic programming in bioinformatics
PDF
morph-LDP: An R2RML-based Linked Data Platform implementation
PPTX
Corpus linguistics
PDF
Comparison Between Different Types Of Vectors
PDF
How to Assess Integrity Risks for a Company ?
PDF
Article Fogo Glissement Caldeira
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Experiences with logic programming in bioinformatics
morph-LDP: An R2RML-based Linked Data Platform implementation
Corpus linguistics
Comparison Between Different Types Of Vectors
How to Assess Integrity Risks for a Company ?
Article Fogo Glissement Caldeira

Viewers also liked (20)

PDF
Mainframe group presentation
PPTX
Nordic e commerce3
PPTX
Limecoconut
PPT
Issr plodinec
PDF
Color Illustrations
PPT
PPTX
Responding to Climate Change at the Local Level
PDF
Empower students to write with digital tools slide share
PPS
Primar nova filial
PDF
Introductiedag 11 12 [compatibiliteitsmodus]
PDF
Manifesto Assistenza Sessuale
ODP
Оптимизация интерактивного тестирования с использованием метрики Покрытие кода
PPTX
Jisc webinar engaging building users 2013
PPSX
Conflux: GPGPU для .NET (ADD`2010)
PDF
101 tips for the classroom
PPTX
Artsmart2
PDF
HP Server og Lagring SPOR 1
PDF
WSRM_WriteUp
PDF
Hiring and retaining legal staff in Asia-Pacific Businesses
Mainframe group presentation
Nordic e commerce3
Limecoconut
Issr plodinec
Color Illustrations
Responding to Climate Change at the Local Level
Empower students to write with digital tools slide share
Primar nova filial
Introductiedag 11 12 [compatibiliteitsmodus]
Manifesto Assistenza Sessuale
Оптимизация интерактивного тестирования с использованием метрики Покрытие кода
Jisc webinar engaging building users 2013
Conflux: GPGPU для .NET (ADD`2010)
101 tips for the classroom
Artsmart2
HP Server og Lagring SPOR 1
WSRM_WriteUp
Hiring and retaining legal staff in Asia-Pacific Businesses
Ad

Similar to Chambwe bosc2010 (20)

PDF
NGS: Mapping and de novo assembly
PDF
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...
PDF
Initial steps towards a production platform for DNA sequence analysis on the ...
PPTX
Enhancing non-Perl bioinformatic applications with Perl
PPTX
Enhancing non-Perl bioinformatic applications with Perl
PDF
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
PPTX
Workshop NGS data analysis - 1
PDF
Discovery and annotation of variants by exome analysis using NGS
PDF
Genome_annotation@BioDec: Python all over the place
PDF
PPT
Programming languages vienna
PPTX
Bioinformatic tool for Annotation of gene
PPTX
2016 davis-plantbio
PPTX
Reproducibility - The myths and truths of pipeline bioinformatics
PPTX
Data analysis & integration challenges in genomics
PPTX
Enabling Large Scale Sequencing Studies through Science as a Service
PPT
Hands on training_biological_databases.ppt
PDF
Getting Started with RNA-Seq Data Analysis
PDF
A Prlic - BioJava update
NGS: Mapping and de novo assembly
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...
Initial steps towards a production platform for DNA sequence analysis on the ...
Enhancing non-Perl bioinformatic applications with Perl
Enhancing non-Perl bioinformatic applications with Perl
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
Workshop NGS data analysis - 1
Discovery and annotation of variants by exome analysis using NGS
Genome_annotation@BioDec: Python all over the place
Programming languages vienna
Bioinformatic tool for Annotation of gene
2016 davis-plantbio
Reproducibility - The myths and truths of pipeline bioinformatics
Data analysis & integration challenges in genomics
Enabling Large Scale Sequencing Studies through Science as a Service
Hands on training_biological_databases.ppt
Getting Started with RNA-Seq Data Analysis
A Prlic - BioJava update
Ad

More from BOSC 2010 (20)

PPTX
Mercer bosc2010 microsoft_framework
PPT
Langmead bosc2010 cloud-genomics
PDF
Schultheiss bosc2010 persistance-web-services
PPT
Swertz bosc2010 molgenis
PPT
Rice bosc2010 emboss
PDF
Morris bosc2010 evoker
PPT
Kono bosc2010 pathway_projector
PPTX
Kanterakis bosc2010 molgenis
PDF
Gautier bosc2010 pythonbioconductor
PDF
Gardler bosc2010 community_developmentattheasf
PDF
Friedberg bosc2010 iprstats
PDF
Fields bosc2010 bio_perl
PDF
Chapman bosc2010 biopython
PDF
Bonnal bosc2010 bio_ruby
PDF
Puton bosc2010 bio_python-modules-rna
PPT
Bader bosc2010 cytoweb
PDF
Talevich bosc2010 bio-phylo
PPTX
Zmasek bosc2010 aptx
PPTX
Wilkinson bosc2010 moby-to-sadi
PPT
Venkatesan bosc2010 onto-toolkit
Mercer bosc2010 microsoft_framework
Langmead bosc2010 cloud-genomics
Schultheiss bosc2010 persistance-web-services
Swertz bosc2010 molgenis
Rice bosc2010 emboss
Morris bosc2010 evoker
Kono bosc2010 pathway_projector
Kanterakis bosc2010 molgenis
Gautier bosc2010 pythonbioconductor
Gardler bosc2010 community_developmentattheasf
Friedberg bosc2010 iprstats
Fields bosc2010 bio_perl
Chapman bosc2010 biopython
Bonnal bosc2010 bio_ruby
Puton bosc2010 bio_python-modules-rna
Bader bosc2010 cytoweb
Talevich bosc2010 bio-phylo
Zmasek bosc2010 aptx
Wilkinson bosc2010 moby-to-sadi
Venkatesan bosc2010 onto-toolkit

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Encapsulation theory and applications.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
KodekX | Application Modernization Development
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Machine learning based COVID-19 study performance prediction
PPT
Teaching material agriculture food technology
PPTX
A Presentation on Artificial Intelligence
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Empathic Computing: Creating Shared Understanding
NewMind AI Weekly Chronicles - August'25 Week I
Review of recent advances in non-invasive hemoglobin estimation
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation_ Review paper, used for researhc scholars
Encapsulation theory and applications.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
KodekX | Application Modernization Development
Spectral efficient network and resource selection model in 5G networks
Machine learning based COVID-19 study performance prediction
Teaching material agriculture food technology
A Presentation on Artificial Intelligence
The AUB Centre for AI in Media Proposal.docx
NewMind AI Monthly Chronicles - July 2025
Empathic Computing: Creating Shared Understanding

Chambwe bosc2010

  • 1. THE GOBY FRAMEWORK: TOWARDS EFFICIENT NEXT-GENERATION SEQUENCING DATA ANALYSIS Nyasha Chambwe , Kevin C. Dorff, Marko Srdanovic, Xutao Deng, Stuart J.D. Andrews, Fabien Campagne The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine; Department of Physiology and Biophysics Weill Medical College of Cornell University http://goby.campagnelab.org
  • 2. Applications of Next Generation Sequencing McPherson J.D. Nat Methods. 2009
  • 3. Next Generation Sequencers Metzker, M.L. Nat Rev Genet. 2010 Roche/454 GS FLX Titanium Illumina/Solexa GA IIe Life Technologies SOLiD 3 Helicos BioSciences Heliscope NGS Chemistry Pyrosequencing Reversible Terminators Sequencing by ligation Reversible Terminators Avg Read Length (bp) 330 75 50 32 Run Time (days) 0.35 4 7 8 Giga bases/run 0.45 18 30 37 Million reads/run 1.36 240 600 1156
  • 4. Next Generation Sequence Data Formats Key Limitations Text based formats do not scale well to handle large amounts of data Naïve compression prevents semi-random access
  • 5. File Format Wish List Structured schema/data representation Well specified and documented (not ambiguous) Fast parsing speed Language and operating system portability Backward and forward compatibility Compression Random access Streaming
  • 6. File Formats Low Level APIs Tools/Utilities Applications Java, C++, Python RNA-Seq Pipeline IGV Plug-in The Goby Software Framework reads alignments histograms Readers Writers Iterators File Format Conversions Alignment Processing Visualization
  • 7. File Formats Low Level APIs Tools/Utilities Applications Java, C++, Python RNA-Seq Pipeline IGV Plug-in The Goby Software Framework Readers Writers Iterators File Format Conversions Alignment Processing Visualization
  • 8. Structured non-ambiguous representation Goby uses Protocol Buffers (PB) to provide “a flexible, efficient, automated mechanism for serializing structured data” (PB website) PB generate parsers in different languages e.g., Java, C++, Python, Perl, R, C, C#, Visual Basic, PHP, Objective C, Ruby, Common Lisp Provide forward and backward compatibility
  • 9. Goby compact formats Data is represented by Protocol Buffers as a message defined by a .proto file
  • 10. File Format Wish List Structured schema/data representation Well specified and documented (not ambiguous) Fast parsing speed Language and operating system portability Backward and forward compatibility Compression Random access Streaming
  • 11. Goby compact formats Chunking: Semi-random access Efficient parallel processing
  • 12. File Format Wish List Structured schema/data representation Well specified and documented (not ambiguous) Fast parsing speed Language and operating system portability Backward and forward compatibility Compression Random access Streaming
  • 13. Goby File Size Comparisons MAQC sample B = Ambion Human Brain Reference RNA (HBRR or HBR, Catalog #6050) sequenced on four next-gen platforms
  • 14. File Formats Low Level APIs Tools/Utilities Applications Java, C++, Python Readers Writers Iterators The Goby Software Framework reads alignments histograms File Format Conversions Alignment Processing Visualization RNA-Seq Pipeline IGV Plug-in
  • 15. File Formats Low Level APIs Tools/Utilities Applications Java, C++, Python Readers Writers Iterators RNA-Seq Pipeline IGV Plug-in The Goby Software Framework reads alignments histograms File Format Conversions Alignment Processing Visualization
  • 16. Alignment Iterator Code fragment to: Scan through two alignments (input1, input2) Print information for each entry Print information for chromosomes 1,2,X only
  • 17. File Formats Low Level APIs Tools/Utilities Applications Java, C++, Python RNA-Seq Pipeline IGV Plug-in The Goby Software Framework reads alignments histograms Readers Writers Iterators File Format Conversions Alignment Processing Visualization
  • 18. File Formats Low Level APIs Tools/Utilities Applications Java, C++, Python RNA-Seq Pipeline IGV Plug-in The Goby Software Framework reads alignments histograms Readers Writers Iterators File Format Conversions Alignment Processing Visualization
  • 19. RNA-Seq Pipeline Objective: To determine levels of expression in samples and perform differential expression analysis Supports: Mapping to full genome Mapping to annotated cDNAs (reads match inside exons and across exon-exon boundaries) Sequencing platform independent Published normalization methods implemented Mortazavi A et al. Nat Methods. 2008 Bullard JH et al. BMC Bioinformatics. 2010 Bias correction for platform specific biases Hansen KD et al. Nucleic Acids Res. 2010
  • 21. Conclusion Goby file formats are efficient and non-ambiguous Alignments are about five times smaller than BAM alignments API makes it easy to write efficient code to handle large datasets Framework provides utilities and analysis pipelines for common NGS data analysis tasks
  • 22. Acknowledgements Campagne Lab Fabien Campagne Kevin C. Dorff Marko Srdanovic Stuart J.D. Andrews Broad Institute Jim Robinson http://goby.campagnelab.org FDA/NCTR Leming Shi Sequencing Quality Control Project (SEQC) Helicos Illumina Life Technologies Roche
  • 23.  

Editor's Notes

  • #3: Applications of NGS include Explosion of NGS A gap exists between current sequence-generation and data analysis capabilities to extract relevant biological insights
  • #4: Several sequencing platforms available on the market Each with unique chemistry and producing data with different characteristics Throughput varies  very large
  • #5: Preponderance of NGS data file formats to represent these data
  • #6: Here is a list of characteristics we find desirable in a NGS file format Transition: Developed file formats that meet these requirements. File formats are not sufficient therefore we have developed a framework to use these formats and create analysis tools
  • #7: This is an outline of the Goby Software Framework
  • #8: Now I will discuss File formats
  • #9: PB think xml but better
  • #10: Brief overview of how schemas are written using PBs A collection of messages of type readEntries
  • #11: Transition: to achieve compression we gzip collections of messages
  • #12: Protocol buffers do not support messages larger than a few megabytes Contribution of Goby is implementing Protocol buffers in such way to remove the collection size limitation scale for very large messages Overcome by splitting messages into chunks Each Chunk of a compact reads file represents 10,000 or less ReadEntry messages Supports semi random access Chunking leveraged for parallel processing – different servers can access chunks independently - Semi Random Access
  • #13: Gzip and chunking meet the requirements for random access and streaming Transition: how well do we do with respect to file sizes
  • #14: Apple --- apples comparison Multiple alignments
  • #15: Formats are compact How can YOU use it? Low level API’s
  • #17: One practical example of printing entries in an alignment file Goby makes it easy to write code to iterate over the contents of multiple compact alignment files
  • #18: Goby provides utilities to help build analysis pipeline
  • #21: MAQC sample B = Ambion Human Brain Reference RNA (HBRR or HBR, Catalog #6050) sequenced on multiple platforms Normalized gene expression counts RPKM Random hexamer priming results in a bias in nucleotide composition at the start of sequence reads Hansen KD. et al. Nucleic Acids Res. 2010 Jul 1;38(12):e131. Epub2010 Apr 14 Hansen Reweighting scheme to correct for that bias implemented in Goby for genes
  • #23: Ambion Human Brain Reference RNA -- MAQCII sample B Different Brain regions from 23 donors.