SlideShare a Scribd company logo
Enhanced structural variant and breakpoint
   detection using SVMerge by integration of multiple
   detection methods and local assembly

   Kim Wong/Thomas Keane
   Vertebrate Resequencing Informatics

   http://svmerge.sourceforge.net




Vertebrate Resequencing Informatics   22nd March, 2011
Genomic Structural Variation

 Large DNA rearrangements (>100bp)
 Frequent causes of disease
   Referred to as genomic disorders
   Mendelian diseases or complex traits such as behaviors
        E.g. increase in gene dosage due to increase in copy number
   Prevalent in cancer genomes
 Many types of genomic structural variation (SV)
   Insertions, deletions, copy number changes, inversions,
     translocations & complex events
 Comparative genomic hybridization (CGH) traditionally used to for copy
 number discovery
   CNVs of 1–50 kb in size have been under-ascertained
 Next-gen sequencing revolutionised field of SV discovery
   Parallel sequencing of ends of large numbers of DNA fragments
   Examine alignment distance of reads to discover presence of
     genomic rearrangments
   Resolution down to ~100bp


Vertebrate Resequencing Informatics   22nd March, 2011
Simple types of Structural Variation




Vertebrate Resequencing Informatics   22nd March, 2011
Deletion

 SV Visualisation
   LookSeq viewer
   Read pairs displayed
   Y axis is aligned insert size
 Deletions are easily spotted
   Read pairs are mapped
     further apart than expected
   Coverage is zero across
     the deletion sequence
 Deletion in NOD/ShiLtJ




Vertebrate Resequencing Informatics   22nd March, 2011
Inversion



                                Mate pairs align in the same orientation




                                       Coverage zero at breakpoints


Vertebrate Resequencing Informatics   22nd March, 2011
Insertion




One end
mapped reads



                                          Coverage zero at breakpoint



 Vertebrate Resequencing Informatics   22nd March, 2011
Complex SV Events




                                          Inversion




                                  Insertion              Insertion
Vertebrate Resequencing Informatics   22nd March, 2011
Human Examples




                                       Stankiewicz and Lupski (2010) Ann. Rev. Med.

Vertebrate Resequencing Informatics   22nd March, 2011
Example 2: Transposable element insertion in mice




Vertebrate Resequencing Informatics   22nd March, 2011
SVMerge

 Initially developed for mouse genomes project
     Several software packages currently available to discover SVs
 Various approaches using information from anomalously mapped read pairs OR
 read depth analysis
 No single SV caller is able to detect the full range of structural variants
     Paired-end mapping information, for example, cannot detect SVs where the
       read pairs do not flank the SV breakpoints
     Insertion calls made using the split-mapping approach are also size-limited
       because the whole insertion breakpoint must be contained within a read
     Read-depth approaches can identify copy number changes without the need
       for read-pair support, but cannot find copy number neutral events
 SVMerge, a meta SV calling pipeline, which makes SV predictions with a
 collection of SV callers
     Input is a BAM file per sample
     Run callers individually + outputs sanitized into standard BED format
     SV calls merged, and computationally validated using local de novo assembly
     Primarily a SV discovery/calling + validation tool



Vertebrate Resequencing Informatics   22nd March, 2011
SVMerge Workflow




                                                         Wong et al (2010)

Vertebrate Resequencing Informatics   22nd March, 2011
SV Callers




                                                         Wong et al (2010)




Vertebrate Resequencing Informatics   22nd March, 2011
Local Assembly Validation

 Key to the approach is the computational
 validation step
   Local assembly and breakpoint refinement
   All SV calls (except those lacking read
      pair support e.g. CNG/CNL)
 Algorithm
   Gather mapped reads, and any
      unmapped mate-pairs (<1kb of a insertion
      breakpoint, <2kb of all other SV types)
   Run local velvet assembly
   Realign the contigs produced with
      exonerate
   Detect contig breaks proximal to the
      breakpoint(s)

Vertebrate Resequencing Informatics   22nd March, 2011
Breakpoint Improvement (simulated)




Vertebrate Resequencing Informatics   22nd March, 2011
Breakpoint Improvement (Real data)




                                                         Yalchin and Wong et al, in prep



Vertebrate Resequencing Informatics   22nd March, 2011
Application to HapMap trio dataset

 High-depth HapMap trio (NA18506, NA18507, NA18508)
   42x, 42x and 40x
 Reads processed through Vert. Reseq. Pipeline
   Aligned to the GRCh37 human reference using BWA
   Single BAM file for each individual
 BreakDancerMax, Pindel, RDXplorer, SECluster, and RetroSeq
 Exclude calls
   600 bp from a reference sequence gap
   1 Mb from a centromere or telomere
 Computational validation of raw candidate calls




Vertebrate Resequencing Informatics   22nd March, 2011
NA18506 Results




Vertebrate Resequencing Informatics   22nd March, 2011
Does multiple callers discover more SVs?




Vertebrate Resequencing Informatics   22nd March, 2011
How do the calls measure up?

 Compared the overlap of the deletion, gain, and inversion calls
 against the curated Database of Genomic Variants
   Overlapped with calls in DGV at a rate significantly higher than
      expected by random chance
   Deletions in DGV: 71% (NA18506), 81% (NA18507), and 71%
      (NA18508)
   Copy number gains in DGV: 29% (NA18506), 32% (NA18507),
      and 36% (NA18508)
   Inversions in DGV: 47% (NA18506), 69% (NA18507), and 51%
      (NA18508)
 Child calls not in DGV also called in the parents
   Further 18% deletions, 32% inversions, 54% duplications
   Estimated max. false positive rate of 11%, 21%, and 17%
 All child-only SV calls comprise 11% of the child's final SV call
   Considerable improvement from 'merged raw’ (50% unique)


Vertebrate Resequencing Informatics   22nd March, 2011
Complex SV Types




                                                         Yalchin and Wong et al, in prep

Vertebrate Resequencing Informatics   22nd March, 2011
Future Work

 SVMerge primarily a discovery and validation tool
   Extensible pipeline so that calls from any method to be easily
    incorporated
 Developed primarily for mouse genomes project
   Successfully applied to human trio dataset
   Computationally validation approach reduces false positives
 Complex SVs
   Cataloging repeating combinations of multiple SV events in small
    loci
 2011 development
   Low coverage cross-population SV discovery
   Genotyping existing SVs in new samples
   Better support for heterozygous calls
   Integration of SVMerge into Vert. Reseq. pipeline for UK10K
Vertebrate Resequencing Informatics   22nd March, 2011

More Related Content

PDF
Part 1 of RNA-seq for DE analysis: Defining the goal
PDF
DEseq, voom and vst
PDF
Apollo Introduction for the Chestnut Research Community
PDF
wings2014 Workshop 1 Design, sequence, align, count, visualize
PPTX
Catalyzing Plant Science Research with RNA-seq
PPTX
Rna seq
PPTX
RNASeq DE methods review Applied Bioinformatics Journal Club
PPTX
RNA-seq differential expression analysis
Part 1 of RNA-seq for DE analysis: Defining the goal
DEseq, voom and vst
Apollo Introduction for the Chestnut Research Community
wings2014 Workshop 1 Design, sequence, align, count, visualize
Catalyzing Plant Science Research with RNA-seq
Rna seq
RNASeq DE methods review Applied Bioinformatics Journal Club
RNA-seq differential expression analysis

What's hot (20)

PDF
An introduction to RNA-seq data analysis
PDF
Talk ABRF 2015 (Gunnar Rätsch)
PDF
2013 CSM
PDF
RNA-seq: general concept, goal and experimental design - part 1
PDF
RNA-seq: Mapping and quality control - part 3
PDF
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
PPTX
Dgaston dec-06-2012
PPTX
NGS: bioinformatic challenges
PPTX
Knowing Your NGS Upstream: Alignment and Variants
PDF
RNA-seq: analysis of raw data and preprocessing - part 2
PPT
CSU Next Generation Sequencing Core 06/09/2015
PDF
Rna seq
PDF
Part 4 of RNA-seq for DE analysis: Extracting count table and QC
PDF
Data analysis pipelines for NGS applications
PDF
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
PPT
20170209 ngs for_cancer_genomics_101
PDF
Part 2 of RNA-seq for DE analysis: Investigating raw data
PPTX
RNA-seq Data Analysis Overview
PDF
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
PDF
Apollo - A webinar for the Phascolarctos cinereus research community
An introduction to RNA-seq data analysis
Talk ABRF 2015 (Gunnar Rätsch)
2013 CSM
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: Mapping and quality control - part 3
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
Dgaston dec-06-2012
NGS: bioinformatic challenges
Knowing Your NGS Upstream: Alignment and Variants
RNA-seq: analysis of raw data and preprocessing - part 2
CSU Next Generation Sequencing Core 06/09/2015
Rna seq
Part 4 of RNA-seq for DE analysis: Extracting count table and QC
Data analysis pipelines for NGS applications
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
20170209 ngs for_cancer_genomics_101
Part 2 of RNA-seq for DE analysis: Investigating raw data
RNA-seq Data Analysis Overview
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Apollo - A webinar for the Phascolarctos cinereus research community
Ad

Similar to Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly (20)

PDF
DNA SEQUENCING_BASICS_NGS_SANGER_NGS_SLIDES
PPTX
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
PDF
The Clinical Significance of Transcript Alignment Discrepancies
PDF
The Clinical Significance of Transcript Alignment Discrepancies … and tools t...
PPT
Integrating phylogenetic inference and metadata visualization for NGS data
PDF
iEvoBio Hertweck abstract 2012
PPTX
160627 giab for festival sv workshop
PDF
2015 09-29-sbc322-methods.key
PDF
2944_IJDR_final_version
PDF
2944_IJDR_final_version
PPTX
Variant (SNPs/Indels) calling in DNA sequences, Part 2
PPTX
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
PPT
BiDiBlast Tool Presentation
PPT
Bioinformatica 08-12-2011-t8-go-hmm
PDF
Image Based Transcriptomics: An Overview
PPTX
Transcript detection in RNAseq
PDF
Overview of methods for variant calling from next-generation sequence data
PPT
Dr Justin Schonfeld - Bioinformatics Applications
PDF
Mouse Genomes Project + RNA-Editing
PPT
DNA SEQUENCING_BASICS_NGS_SANGER_NGS_SLIDES
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
The Clinical Significance of Transcript Alignment Discrepancies
The Clinical Significance of Transcript Alignment Discrepancies … and tools t...
Integrating phylogenetic inference and metadata visualization for NGS data
iEvoBio Hertweck abstract 2012
160627 giab for festival sv workshop
2015 09-29-sbc322-methods.key
2944_IJDR_final_version
2944_IJDR_final_version
Variant (SNPs/Indels) calling in DNA sequences, Part 2
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
BiDiBlast Tool Presentation
Bioinformatica 08-12-2011-t8-go-hmm
Image Based Transcriptomics: An Overview
Transcript detection in RNAseq
Overview of methods for variant calling from next-generation sequence data
Dr Justin Schonfeld - Bioinformatics Applications
Mouse Genomes Project + RNA-Editing
Ad

More from Thomas Keane (12)

PDF
Multiple mouse reference genomes and strain specific gene annotations
PDF
Mousegenomes tk-wtsi (1)
PDF
2014 Wellcome Trust Advances Course: NGS Course - Lecture2
PDF
Wellcome Trust Advances Course: NGS Course - Lecture1
PDF
Large Scale Resequencing: Approaches and Challenges
PDF
Assessing the impact of transposable element variation on mouse phenotypes an...
PDF
Overview of methods for variant calling from next-generation sequence data
PDF
Next generation sequencing in cloud computing era
PDF
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
PDF
Mouse Genomes Poster - Genetics 2010
PDF
Mouse Genomes Project Summary June 2010
PDF
ECCB 2010 Next-gen sequencing Tutorial
Multiple mouse reference genomes and strain specific gene annotations
Mousegenomes tk-wtsi (1)
2014 Wellcome Trust Advances Course: NGS Course - Lecture2
Wellcome Trust Advances Course: NGS Course - Lecture1
Large Scale Resequencing: Approaches and Challenges
Assessing the impact of transposable element variation on mouse phenotypes an...
Overview of methods for variant calling from next-generation sequence data
Next generation sequencing in cloud computing era
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
Mouse Genomes Poster - Genetics 2010
Mouse Genomes Project Summary June 2010
ECCB 2010 Next-gen sequencing Tutorial

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
KodekX | Application Modernization Development
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Network Security Unit 5.pdf for BCA BBA.
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
cuic standard and advanced reporting.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Programs and apps: productivity, graphics, security and other tools
Dropbox Q2 2025 Financial Results & Investor Presentation
Advanced methodologies resolving dimensionality complications for autism neur...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Digital-Transformation-Roadmap-for-Companies.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
20250228 LYD VKU AI Blended-Learning.pptx
KodekX | Application Modernization Development
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Encapsulation_ Review paper, used for researhc scholars
NewMind AI Weekly Chronicles - August'25 Week I
Network Security Unit 5.pdf for BCA BBA.
The AUB Centre for AI in Media Proposal.docx
cuic standard and advanced reporting.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly

  • 1. Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly Kim Wong/Thomas Keane Vertebrate Resequencing Informatics http://svmerge.sourceforge.net Vertebrate Resequencing Informatics 22nd March, 2011
  • 2. Genomic Structural Variation Large DNA rearrangements (>100bp) Frequent causes of disease  Referred to as genomic disorders  Mendelian diseases or complex traits such as behaviors  E.g. increase in gene dosage due to increase in copy number  Prevalent in cancer genomes Many types of genomic structural variation (SV)  Insertions, deletions, copy number changes, inversions, translocations & complex events Comparative genomic hybridization (CGH) traditionally used to for copy number discovery  CNVs of 1–50 kb in size have been under-ascertained Next-gen sequencing revolutionised field of SV discovery  Parallel sequencing of ends of large numbers of DNA fragments  Examine alignment distance of reads to discover presence of genomic rearrangments  Resolution down to ~100bp Vertebrate Resequencing Informatics 22nd March, 2011
  • 3. Simple types of Structural Variation Vertebrate Resequencing Informatics 22nd March, 2011
  • 4. Deletion SV Visualisation  LookSeq viewer  Read pairs displayed  Y axis is aligned insert size Deletions are easily spotted  Read pairs are mapped further apart than expected  Coverage is zero across the deletion sequence Deletion in NOD/ShiLtJ Vertebrate Resequencing Informatics 22nd March, 2011
  • 5. Inversion Mate pairs align in the same orientation Coverage zero at breakpoints Vertebrate Resequencing Informatics 22nd March, 2011
  • 6. Insertion One end mapped reads Coverage zero at breakpoint Vertebrate Resequencing Informatics 22nd March, 2011
  • 7. Complex SV Events Inversion Insertion Insertion Vertebrate Resequencing Informatics 22nd March, 2011
  • 8. Human Examples Stankiewicz and Lupski (2010) Ann. Rev. Med. Vertebrate Resequencing Informatics 22nd March, 2011
  • 9. Example 2: Transposable element insertion in mice Vertebrate Resequencing Informatics 22nd March, 2011
  • 10. SVMerge Initially developed for mouse genomes project   Several software packages currently available to discover SVs Various approaches using information from anomalously mapped read pairs OR read depth analysis No single SV caller is able to detect the full range of structural variants   Paired-end mapping information, for example, cannot detect SVs where the read pairs do not flank the SV breakpoints   Insertion calls made using the split-mapping approach are also size-limited because the whole insertion breakpoint must be contained within a read   Read-depth approaches can identify copy number changes without the need for read-pair support, but cannot find copy number neutral events SVMerge, a meta SV calling pipeline, which makes SV predictions with a collection of SV callers   Input is a BAM file per sample   Run callers individually + outputs sanitized into standard BED format   SV calls merged, and computationally validated using local de novo assembly   Primarily a SV discovery/calling + validation tool Vertebrate Resequencing Informatics 22nd March, 2011
  • 11. SVMerge Workflow Wong et al (2010) Vertebrate Resequencing Informatics 22nd March, 2011
  • 12. SV Callers Wong et al (2010) Vertebrate Resequencing Informatics 22nd March, 2011
  • 13. Local Assembly Validation Key to the approach is the computational validation step  Local assembly and breakpoint refinement  All SV calls (except those lacking read pair support e.g. CNG/CNL) Algorithm  Gather mapped reads, and any unmapped mate-pairs (<1kb of a insertion breakpoint, <2kb of all other SV types)  Run local velvet assembly  Realign the contigs produced with exonerate  Detect contig breaks proximal to the breakpoint(s) Vertebrate Resequencing Informatics 22nd March, 2011
  • 14. Breakpoint Improvement (simulated) Vertebrate Resequencing Informatics 22nd March, 2011
  • 15. Breakpoint Improvement (Real data) Yalchin and Wong et al, in prep Vertebrate Resequencing Informatics 22nd March, 2011
  • 16. Application to HapMap trio dataset High-depth HapMap trio (NA18506, NA18507, NA18508)  42x, 42x and 40x Reads processed through Vert. Reseq. Pipeline  Aligned to the GRCh37 human reference using BWA  Single BAM file for each individual BreakDancerMax, Pindel, RDXplorer, SECluster, and RetroSeq Exclude calls  600 bp from a reference sequence gap  1 Mb from a centromere or telomere Computational validation of raw candidate calls Vertebrate Resequencing Informatics 22nd March, 2011
  • 17. NA18506 Results Vertebrate Resequencing Informatics 22nd March, 2011
  • 18. Does multiple callers discover more SVs? Vertebrate Resequencing Informatics 22nd March, 2011
  • 19. How do the calls measure up? Compared the overlap of the deletion, gain, and inversion calls against the curated Database of Genomic Variants  Overlapped with calls in DGV at a rate significantly higher than expected by random chance  Deletions in DGV: 71% (NA18506), 81% (NA18507), and 71% (NA18508)  Copy number gains in DGV: 29% (NA18506), 32% (NA18507), and 36% (NA18508)  Inversions in DGV: 47% (NA18506), 69% (NA18507), and 51% (NA18508) Child calls not in DGV also called in the parents  Further 18% deletions, 32% inversions, 54% duplications  Estimated max. false positive rate of 11%, 21%, and 17% All child-only SV calls comprise 11% of the child's final SV call  Considerable improvement from 'merged raw’ (50% unique) Vertebrate Resequencing Informatics 22nd March, 2011
  • 20. Complex SV Types Yalchin and Wong et al, in prep Vertebrate Resequencing Informatics 22nd March, 2011
  • 21. Future Work SVMerge primarily a discovery and validation tool  Extensible pipeline so that calls from any method to be easily incorporated Developed primarily for mouse genomes project  Successfully applied to human trio dataset  Computationally validation approach reduces false positives Complex SVs  Cataloging repeating combinations of multiple SV events in small loci 2011 development  Low coverage cross-population SV discovery  Genotyping existing SVs in new samples  Better support for heterozygous calls  Integration of SVMerge into Vert. Reseq. pipeline for UK10K Vertebrate Resequencing Informatics 22nd March, 2011