SlideShare a Scribd company logo
Algorithms and filters used to improve the Tribolium 
draft Assembly with Physical Maps Based on 
Imaging Ultra-Long Single DNA Molecules 
! 
Jennifer Shelton 
2014
Assembly Pipeline 
3) use sequence reference to adjust molecule stretch for each scan
Assembly Pipeline 
In recent datasets when SNR is low and alignment is good we see a spike in 
bases per pixel (bpp) in the first scan, a plateau and a lower plateau 
First scan in a 
flow cell
Assembly Pipeline 
5) Use sequence reference to determine assembly noise parameters. 
Estimated genome size is used to set the p-value threshold.
Assembly Pipeline 
6/7) Variants of the starting p-value and default minimum molecule length are 
explored in nine assemblies.
Current Tribolium sequence-based assembly 
Input file N50 (Mb) Number 
of Contigs 
Cumulative 
Length (Mb) 
Genome FASTA 1.16 2240 160.74 
in silico CMAP from FASTA 1.20 223 152.53 
223 scaffolds from the sequence-based assembly were longer than 20 (kb) 
with more than 5 labels and were converted into in silico CMAPs
Assembly Results 
Input file N50 (Mb) Number 
of Contigs 
Cumulative 
Length (Mb) 
Genome FASTA 1.16 2240 160.74 
in silico CMAP from FASTA 1.20 223 152.53 
CMAP from assembled BNG 
molecules (BNG CMAP) 
1.35 216 200.47 
BNG assembled molecules had a higher N50 and longer cumulative length 
than the sequence assembly 
! 
The estimated size of the Tribolium genome is ~200 (Mb)
Simplest XMAP alignment description 
1 (Mb) 
1.1 (Mb) 
1.1 (Mb) 1.3 (Mb) 
Breadth of alignment coverage for in silico CMAP: 2.1 (Mb) 
Total alignment length for in silico CMAP: 2.1 (Mb) 
! 
Breadth of alignment coverage for BNG CMAP: 2.4 (Mb) 
Total alignment length for BNG CMAP: 2.4 (Mb) 
in silico CMAP 
from genome 
FASTA 
CMAP from 
assembled 
molecules 
in silico CMAP 1 in silico CMAP 2 
BNG CMAP 1 BNG CMAP 2
Complex XMAP alignment description 
1 (Mb) 
in silico CMAP 1 
BNG CMAP 1 BNG CMAP 2 
1.1 (Mb) 1.3 (Mb) 
Breadth of alignment coverage for in silico CMAP: 1 (Mb) 
Total alignment length for in silico CMAP: 2 (Mb) 
! 
Breadth of alignment coverage for BNG CMAP: 2.4 (Mb) 
Total alignment length for BNG CMAP: 2.4 (Mb) 
in silico CMAP 
from genome 
FASTA 
CMAP from 
assembled 
molecules
Alignment of CMAPs 
1 (Mb) 
in silico CMAP 1 
BNG CMAP 1 BNG CMAP 2 
1.1 (Mb) 1.3 (Mb) 
Breadth of alignment coverage compared to total aligned length can indicate 
relevant relationships between assemblies 
! 
In this example differences between "breadth" and "total" length could be due to: 
! 
Duplications in sample molecules were extracted from 
Assembly of alternate haplotypes 
Mis-assembly creating redundant contigs 
Collapsed repeat in sequence assembly 
in silico CMAP 
from genome 
FASTA 
CMAP from 
assembled 
molecules
Alignment of BNG assembly to reference genome 
CMAP name Breadth of alignment 
coverage for CMAP 
(Mb) 
Length of total 
alignment for 
CMAP (Mb) 
Percent of CMAP 
aligned 
in silico CMAP from FASTA 124.04 132.40 81 
CMAP from assembled BNG 
molecules (BNG CMAP) 
131.64 132.34 67 
Close to 4% of the alignment of the in silico CMAP appears to be redundant 
! 
Overall 81% of the in silico CMAP aligns to the BNG consensus map
ChLG 9 super! 
Alignment of BNG assembly to reference genome 
scaffold 
BNG consensus 
maps 
ChLG 9! 
scaffolds 
130 131 133 134 132 129 135 127 136 137 BNG consensus 
Typically where redundant alignments occur two BNG consensus maps 
aligned suggesting they represent haplotypes although this has not been 
verified 
maps
Tribolium super-scaffolds overlapping BNG cmap 
ChLG 9 super! 
scaffold 
BNG consensus 
maps 
ChLG 9! 
scaffolds 
128 130 131 133 134 132 BNG consensus 
maps
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
+ in silico CMAP 1 + in silico CMAP 4 
Stitch.pl estimates super scaffolds using alignments of scaffolds and 
assembled BNG molecules using BNG Refaligner 
in silico CMAP 
aligned as 
reference 
+ in silico CMAP 2 - in silico CMAP 3 
BNG CMAP 1 BNG CMAP 2
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
+ in silico CMAP 1 + in silico CMAP 4 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 1 + in silico CMAP 4 
Stitch.pl estimates super scaffolds using alignments of scaffolds and 
assembled BNG molecules using BNG Refaligner 
in silico CMAP 
aligned as 
reference 
alignment is 
inverted and 
used as input for 
stitch 
+ in silico CMAP 2 - in silico CMAP 3 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 2 - in silico CMAP 3
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
+ in silico CMAP 1 + in silico CMAP 4 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 1 + in silico CMAP 4 
BNG CMAP 1 BNG CMAP 2 
Stitch.pl estimates super scaffolds using alignments of scaffolds and 
assembled BNG molecules using BNG Refaligner 
in silico CMAP 
aligned as 
reference 
alignment is 
inverted and 
used as input for 
stitch 
+ in silico CMAP 2 - in silico CMAP 3 
+ in silico CMAP 4 
alignments are 
filtered based on 
alignment length 
relative total 
possible 
alignment length 
and confidence 
+ in silico CMAP 2 - in silico CMAP 3 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 2 - in silico CMAP 3 
+ in silico CMAP 1
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 1 + in silico CMAP 4 
+ in silico CMAP 2 - in silico CMAP 3 
BNG CMAP 1 
+ in silico CMAP 1 
Stitch.pl checks alignment length against potential alignment lengths to find 
relevant global rather than local alignments 
alignment 
passes because 
the alignment 
length is greater 
than 30% of the 
potential 
alignment length
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 1 + in silico CMAP 4 
+ in silico CMAP 2 - in silico CMAP 3 
BNG CMAP 1 
scaffolds 
+ in silico CMAP 2 
Stitch.pl checks alignment length against potential alignment lengths to find 
relevant global rather than local alignments 
alignment 
passes because 
the alignment 
length is greater 
than 30% of the 
potential 
alignment length
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 1 + in silico CMAP 4 
+ in silico CMAP 2 - in silico CMAP 3 
- in silico CMAP 2 
BNG CMAP 2 
Stitch.pl checks alignment length against potential alignment lengths to find 
relevant global rather than local alignments 
alignment 
passes because 
the alignment 
length is greater 
than 30% of the 
potential 
alignment length
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 1 + in silico CMAP 4 
+ in silico CMAP 2 - in silico CMAP 3 
- in silico CMAP 2 
BNG CMAP 2 
Stitch.pl checks alignment length against potential alignment lengths to find 
relevant global rather than local alignments 
alignment fails 
because the 
alignment length 
is less than 30% 
of the potential 
alignment length
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 1 + in silico CMAP 4 
+ in silico CMAP 2 - in silico CMAP 3 
+ in silico CMAP 2 
BNG CMAP 2 
Stitch.pl checks alignment length against potential alignment lengths to find 
relevant global rather than local alignments 
alignment fails 
because the 
alignment length 
is less than 30% 
of the potential 
alignment length
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 1 + in silico CMAP 4 
+ in silico CMAP 2 - in silico CMAP 3 
BNG CMAP 2 
Stitch.pl checks alignment length against potential alignment lengths to find 
relevant global rather than local alignments 
alignment 
passes because 
the alignment 
length is greater 
than 30% of the 
potential 
alignment length 
- in silico CMAP 3
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 1 + in silico CMAP 4 
+ in silico CMAP 2 - in silico CMAP 3 
BNG CMAP 2 
scaffolds 
Stitch.pl checks alignment length against potential alignment lengths to find 
relevant global rather than local alignments 
alignment fails 
because the 
alignment length 
is less than 30% 
of the potential 
alignment length 
- in silico CMAP 3
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 1 + in silico CMAP 4 
+ in silico CMAP 2 - in silico CMAP 3 
BNG CMAP 2 
scaffolds 
Stitch.pl checks alignment length against potential alignment lengths to find 
relevant global rather than local alignments 
alignment 
passes because 
the alignment 
length is greater 
than 30% of the 
potential 
alignment length 
+ in silico CMAP 4
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 2 - in silico CMAP 3 
+ in silico CMAP 4 
high quality 
scaffolding 
alignments... 
+ in silico CMAP 1
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
are filtered for 
longest and 
highest 
confidence 
alignment for 
each in silico 
CMAP 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 2 - in silico CMAP 3 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 2 - in silico CMAP 3 
+ in silico CMAP 4 
+ in silico CMAP 1 + in silico CMAP 4 
high quality 
scaffolding 
alignments... 
+ in silico CMAP 1
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
are filtered for 
longest and 
highest 
confidence 
alignment for 
each in silico 
CMAP 
Passing 
alignments are 
used to super 
scaffold 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 2 - in silico CMAP 3 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 2 - in silico CMAP 3 
+ in silico CMAP 4 
+ in silico CMAP 1 + in silico CMAP 4 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 2 - in silico CMAP 3 
+ in silico CMAP 1 + in silico CMAP 4 
high quality 
scaffolding 
alignments... 
+ in silico CMAP 1
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
Stitch is iterated 
and additional 
super 
scaffolding 
alignments are 
found 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 2 - in silico CMAP 3 
+ in silico CMAP 1 + in silico CMAP 4 
Iteration takes advantage of alignments where sequence-based scaffolds 
stitch BNG consensus maps
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
Stitch is iterated 
and additional 
super 
scaffolding 
alignments are 
found 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 2 - in silico CMAP 3 
+ in silico CMAP 1 + in silico CMAP 4 
Until all super 
scaffolds are 
BNG CMAP 1 BNG CMAP 2 
joined + in silico CMAP 2 - in silico CMAP 3 
+ in silico CMAP 1 + in silico CMAP 4 
Iteration takes advantage of alignments where sequence-based scaffolds 
stitch BNG consensus maps
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
BNG CMAP 1 BNG CMAP 2 
- in silico CMAP 3 
+ in silico CMAP 2 
+ in silico CMAP 4 
+ in silico CMAP 1 
If gap length is estimated to be negative gaps are represented by 100 (bp) 
fillers
Gap lengths 
Distribution of gap lengths for automated output 
Gap length (bp) 
Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had 
known lengths and 26 had negative lengths (set to 100 (bp)) 
! 
Of the manually edited Tribolium super-scaffolds there were 66 gaps had 
known lengths and 24 had negative lengths (set to 100 (bp)) 
Count 
−1500000 −1000000 −500000 0 500000 1000000 
0 5 10 15 20 
Negative gap lengths 
Positive gap lengths
Gap lengths 
Distribution of gap lengths for automated output 
Gap length (bp) 
Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had 
known lengths and 26 had negative lengths (set to 100 (bp)) 
! 
Of the manually edited Tribolium super-scaffolds there were 66 gaps had 
known lengths and 24 had negative lengths (set to 100 (bp)) 
Count 
−1500000 −1000000 −500000 0 500000 1000000 
0 5 10 15 20 
Negative gap lengths 
Positive gap lengths
Negative gap lengths 
Is part of scaffold_23 connected to 136?! 
I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should 
check these assemblies. ! ! In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG 
assembly? 
22 23 129 136 137 
The longest negative gap length is from a BNG consenus map joining in silico 
23 and 136
Negative gap lengths 
Is part of scaffold_23 connected to 136?! 
I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should 
check these assemblies. ! ! In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG 
assembly? 
22 23 129 136 137 
! 
Because the same region of 136 aligns to another BNG consensus map that 
aligns to its chromosome linkage group this alignment was rejected and stitch 
was re-run
scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards? 
Negative gap lengths 
ChLG 2 super! 
scaffold 
ChLG 2 super! 
scaffold 
BNG consensus 
maps 
BNG consensus 
maps 
ChLG 2! 
scaffolds 
133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145 
Two new super scaffolds were created and the sequence similarity is being 
evaluated 
min confidence 10 
scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards? 
ChLG 2! 
scaffolds 
130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145 
U 18 14 16 19 20 21 22 23 24 25 26 27 28 30 
BNG consensus 
maps 
U 18 14 16 19 20 21 22 23 24 25 26 27 28 30 
BNG consensus 
maps
Gap lengths 
Distribution of gap lengths for automated output 
Gap length (bp) 
This negative alignment also indicated a potential assembly issue 
Count 
−1500000 −1000000 −500000 0 500000 1000000 
0 5 10 15 20 
Negative gap lengths 
Positive gap lengths
Negative gap lengths 
This negative gap length is from a BNG consenus map joining in silico 81 and 
102 and 103 
Half of scaffold_81 aligns with ChLG7
Negative gap lengths 
Half of scaffold_81 aligns with ChLG7 
79 80 81 82 83 
Because the other half of 81 aligns to another BNG consensus map that aligns 
to its chromosome linkage group this alignment was rejected and stitch was re-run 
! 
The BNG maps suggest a mis-assembly of in silico 81 at a sequence level
Distribution of gap lengths for automated output 
Gap length (bp) 
Count 
−1500000 −1000000 −500000 0 500000 1000000 
0 5 10 15 20 
Negative gap lengths 
Positive gap lengths 
Gap lengths 
All extremely small negative gap lengths, < -40,000 (bp) (shaded), were 
independently flagged as potential sequence mis-assemblies to be checked at 
the sequence-level
Distribution of gap lengths for automated output 
Gap length (bp) 
Count 
−1500000 −1000000 −500000 0 500000 1000000 
0 5 10 15 20 
Negative gap lengths 
Positive gap lengths 
Gap lengths 
All gaps from the shaded regions were also manually rejected and stitch.pl 
was rerun without them for the current super-scaffolded assembly 
! 
We suspect extremely small negative gap sizes may be useful in locating 
sequence mis-assemblies
Tribolium super-scaffolds 
Input file N50 (Mb) Number of 
Contigs 
Cumulative 
Length (Mb) 
genome FASTA 1.16 2240 160.74 
super-scaffold 
FASTA 
4.46 2150 165.92 
N50 of the super-scaffolded genome was ~4 times greater than the original 
! 
Super-scaffolds tend to agree with the Tribolium genetic map
Tribolium super-scaffolds 
Input file N50 (Mb) Number of 
Contigs 
genome FASTA 1.16 2240 160.74 
4.46 2150 165.92 
For Tribolium : 
first minimum percent aligned = 30% 
first minimum confidence = 13 
Cumulative 
Length (Mb) 
second minimum percent aligned = 90% 
second minimum confidence = 8 
! 
super-scaffold 
FASTA 
Lower quality alignments were manually selected if genetic map also supported 
the order 
Complex scaffolds were broken manually for sequence level evaluation
Tribolium super-scaffolds 
min confidence 10 
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold. 
ChLG X was reduced from 13 scaffolds to 2 with one scaffold being moved to 
ChLG 3 
ChLG X super! 
scaffold 
BNG consensus 
maps 
ChLG X! 
scaffolds 
BNG consensus 
maps 
U 3 4 5 6 7 U 8 9 10 11 12 13
Tribolium super-scaffolds 
min confidence 10 
51 U 43 45 44 46 
The second scaffold from ChLG X aligned to scaffolds from a portion of 
ChLG 3 
ChLG 3 super! 
scaffold 
BNG consensus 
maps 
ChLG 3! 
scaffolds 
BNG consensus 
maps 
32 33 34 35 36 2 37 38 39 40 41 42 
ChLG 3 super! 
scaffold 
BNG consensus 
maps 
ChLG 3 super! 
scaffold 
BNG consensus
Tribolium super-scaffolds 
min confidence 10 
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold. 
Two unplaced scaffolds aligned to ChLG X 
ChLG X super! 
scaffold 
BNG consensus 
maps 
ChLG X! 
scaffolds 
BNG consensus 
maps 
U 3 4 5 6 7 U 8 9 10 11 12 13
Tribolium super-scaffolds 
min confidence 10 
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold. 
4% Redundancy in alignment may be from assembly of haplotypes (generally 
observed as two BNG consensus maps aligning to the same in silico map) 
ChLG X super! 
scaffold 
BNG consensus 
maps 
ChLG X! 
scaffolds 
BNG consensus 
maps 
U 3 4 5 6 7 U 8 9 10 11 12 13
Tribolium super-scaffolds overlapping BNG cmap 
min confidence 10 
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold. 
ChLG X super! 
scaffold 
BNG consensus 
maps 
ChLG X! 
scaffolds 
BNG consensus 
maps 
U 3 4 5 6 7 U 8 9 10 11 12 13
Tribmoinl icuonmfid esnucep 1e0r-scaffolds 
ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards? 
128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145 
For ChLG 9 21 scaffolds were reduced to 9 
ChLG 9 super! 
scaffold 
BNG consensus 
maps 
ChLG 9! 
scaffolds 
BNG consensus 
maps
min confidence 10 
Tribolium super-scaffolds 
For ChLG 5 17 scaffolds were reduced to 4 
ChLG 5 super! 
scaffold 
BNG consensus 
maps 
ChLG 5! 
scaffolds 
BNG consensus 
maps 
69 68 70 71 72 73 74 U 75 76 77 78 79 80 81 82 83
Acknowledgements 
K-INBRE Bioinformatics Core! 
Susan Brown - PI 
Nic Herndon - script development 
Nanyan Lu - manual editing 
Michelle Coleman - extractions and running the Irys! 
Zachary Sliefert - metric summaries 
! 
Bionano Genomics! 
Ernest Lam - assembly pipeline best practices assistance 
Weiping Wang - assistance with data formats 
Palak Sheth - collaboration to standardize analysis 
! 
Script availability! 
https://github.com/i5K-KINBRE-script-share/Irys-scaffolding 
BNG scripts available by request from BNG
Gap lengths 
Distribution of gap lengths for automated output 
Gap length (bp) 
Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had 
known lengths and 26 had negative lengths (set to 100 (bp)) 
! 
Of the manually edited Tribolium super-scaffolds there were 66 gaps had 
known lengths and 24 had negative lengths (set to 100 (bp)) 
Count 
−1500000 −1000000 −500000 0 500000 1000000 
0 5 10 15 20 
Negative gap lengths 
Positive gap lengths

More Related Content

PDF
Using BioNano Maps to Improve an Insect Genome Assembly​
PPTX
Exploiting long read sequencing technology to build a substantially improved ...
PDF
AGBT2017 Reference Workshop: Fulton
PDF
Subscriber Traffic & Policy Management (BNG) on the ASR9000 & ASR1000
PDF
VODAFONE / HİZMEY YAPISI
PDF
Bioinformatic core facilities discussion
PDF
Structural Variation Detection
PDF
Lecture1: NGS Analysis on Beocat and an introduction to Perl programming for ...
Using BioNano Maps to Improve an Insect Genome Assembly​
Exploiting long read sequencing technology to build a substantially improved ...
AGBT2017 Reference Workshop: Fulton
Subscriber Traffic & Policy Management (BNG) on the ASR9000 & ASR1000
VODAFONE / HİZMEY YAPISI
Bioinformatic core facilities discussion
Structural Variation Detection
Lecture1: NGS Analysis on Beocat and an introduction to Perl programming for ...

More from Jennifer Shelton (12)

PDF
Journal club slides to discuss "Differential analysis of gene regulation at t...
PPTX
Hub gene selection_ds
PPTX
Applied Bioinformatics Journal Club Pacbio RNA-Seq
PPTX
RNASeq DE methods review Applied Bioinformatics Journal Club
PDF
Bionano genome maps_feb2014
PDF
Translocation detection in lung cancer using mate-pair sequencing and iVIGS
PDF
Summary slides by Prabhakar Chalise of the Oberg et al. 2012 article "Technic...
PPTX
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
PPTX
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle
PDF
Multi-k-mer de novo transcriptome assembly and assembly of assemblies using 4...
PDF
Param selection phase1summary_v2
PDF
Bioinformatic jc 08_14_2013_formal
Journal club slides to discuss "Differential analysis of gene regulation at t...
Hub gene selection_ds
Applied Bioinformatics Journal Club Pacbio RNA-Seq
RNASeq DE methods review Applied Bioinformatics Journal Club
Bionano genome maps_feb2014
Translocation detection in lung cancer using mate-pair sequencing and iVIGS
Summary slides by Prabhakar Chalise of the Oberg et al. 2012 article "Technic...
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle
Multi-k-mer de novo transcriptome assembly and assembly of assemblies using 4...
Param selection phase1summary_v2
Bioinformatic jc 08_14_2013_formal
Ad

Recently uploaded (20)

PDF
Classroom Observation Tools for Teachers
PDF
What if we spent less time fighting change, and more time building what’s rig...
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
PPTX
Lesson notes of climatology university.
PPTX
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PPTX
Introduction to Building Materials
PDF
Hazard Identification & Risk Assessment .pdf
PDF
Complications of Minimal Access Surgery at WLH
PDF
RMMM.pdf make it easy to upload and study
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
Computing-Curriculum for Schools in Ghana
PDF
Indian roads congress 037 - 2012 Flexible pavement
PDF
LDMMIA Reiki Yoga Finals Review Spring Summer
PPTX
Orientation - ARALprogram of Deped to the Parents.pptx
PDF
Trump Administration's workforce development strategy
PDF
Empowerment Technology for Senior High School Guide
Classroom Observation Tools for Teachers
What if we spent less time fighting change, and more time building what’s rig...
Paper A Mock Exam 9_ Attempt review.pdf.
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
Lesson notes of climatology university.
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Introduction to Building Materials
Hazard Identification & Risk Assessment .pdf
Complications of Minimal Access Surgery at WLH
RMMM.pdf make it easy to upload and study
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Computing-Curriculum for Schools in Ghana
Indian roads congress 037 - 2012 Flexible pavement
LDMMIA Reiki Yoga Finals Review Spring Summer
Orientation - ARALprogram of Deped to the Parents.pptx
Trump Administration's workforce development strategy
Empowerment Technology for Senior High School Guide
Ad

Bng presentation draft

  • 1. Algorithms and filters used to improve the Tribolium draft Assembly with Physical Maps Based on Imaging Ultra-Long Single DNA Molecules ! Jennifer Shelton 2014
  • 2. Assembly Pipeline 3) use sequence reference to adjust molecule stretch for each scan
  • 3. Assembly Pipeline In recent datasets when SNR is low and alignment is good we see a spike in bases per pixel (bpp) in the first scan, a plateau and a lower plateau First scan in a flow cell
  • 4. Assembly Pipeline 5) Use sequence reference to determine assembly noise parameters. Estimated genome size is used to set the p-value threshold.
  • 5. Assembly Pipeline 6/7) Variants of the starting p-value and default minimum molecule length are explored in nine assemblies.
  • 6. Current Tribolium sequence-based assembly Input file N50 (Mb) Number of Contigs Cumulative Length (Mb) Genome FASTA 1.16 2240 160.74 in silico CMAP from FASTA 1.20 223 152.53 223 scaffolds from the sequence-based assembly were longer than 20 (kb) with more than 5 labels and were converted into in silico CMAPs
  • 7. Assembly Results Input file N50 (Mb) Number of Contigs Cumulative Length (Mb) Genome FASTA 1.16 2240 160.74 in silico CMAP from FASTA 1.20 223 152.53 CMAP from assembled BNG molecules (BNG CMAP) 1.35 216 200.47 BNG assembled molecules had a higher N50 and longer cumulative length than the sequence assembly ! The estimated size of the Tribolium genome is ~200 (Mb)
  • 8. Simplest XMAP alignment description 1 (Mb) 1.1 (Mb) 1.1 (Mb) 1.3 (Mb) Breadth of alignment coverage for in silico CMAP: 2.1 (Mb) Total alignment length for in silico CMAP: 2.1 (Mb) ! Breadth of alignment coverage for BNG CMAP: 2.4 (Mb) Total alignment length for BNG CMAP: 2.4 (Mb) in silico CMAP from genome FASTA CMAP from assembled molecules in silico CMAP 1 in silico CMAP 2 BNG CMAP 1 BNG CMAP 2
  • 9. Complex XMAP alignment description 1 (Mb) in silico CMAP 1 BNG CMAP 1 BNG CMAP 2 1.1 (Mb) 1.3 (Mb) Breadth of alignment coverage for in silico CMAP: 1 (Mb) Total alignment length for in silico CMAP: 2 (Mb) ! Breadth of alignment coverage for BNG CMAP: 2.4 (Mb) Total alignment length for BNG CMAP: 2.4 (Mb) in silico CMAP from genome FASTA CMAP from assembled molecules
  • 10. Alignment of CMAPs 1 (Mb) in silico CMAP 1 BNG CMAP 1 BNG CMAP 2 1.1 (Mb) 1.3 (Mb) Breadth of alignment coverage compared to total aligned length can indicate relevant relationships between assemblies ! In this example differences between "breadth" and "total" length could be due to: ! Duplications in sample molecules were extracted from Assembly of alternate haplotypes Mis-assembly creating redundant contigs Collapsed repeat in sequence assembly in silico CMAP from genome FASTA CMAP from assembled molecules
  • 11. Alignment of BNG assembly to reference genome CMAP name Breadth of alignment coverage for CMAP (Mb) Length of total alignment for CMAP (Mb) Percent of CMAP aligned in silico CMAP from FASTA 124.04 132.40 81 CMAP from assembled BNG molecules (BNG CMAP) 131.64 132.34 67 Close to 4% of the alignment of the in silico CMAP appears to be redundant ! Overall 81% of the in silico CMAP aligns to the BNG consensus map
  • 12. ChLG 9 super! Alignment of BNG assembly to reference genome scaffold BNG consensus maps ChLG 9! scaffolds 130 131 133 134 132 129 135 127 136 137 BNG consensus Typically where redundant alignments occur two BNG consensus maps aligned suggesting they represent haplotypes although this has not been verified maps
  • 13. Tribolium super-scaffolds overlapping BNG cmap ChLG 9 super! scaffold BNG consensus maps ChLG 9! scaffolds 128 130 131 133 134 132 BNG consensus maps
  • 14. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds + in silico CMAP 1 + in silico CMAP 4 Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner in silico CMAP aligned as reference + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 1 BNG CMAP 2
  • 15. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds + in silico CMAP 1 + in silico CMAP 4 BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner in silico CMAP aligned as reference alignment is inverted and used as input for stitch + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3
  • 16. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds + in silico CMAP 1 + in silico CMAP 4 BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 BNG CMAP 1 BNG CMAP 2 Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner in silico CMAP aligned as reference alignment is inverted and used as input for stitch + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 4 alignments are filtered based on alignment length relative total possible alignment length and confidence + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 1
  • 17. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 1 + in silico CMAP 1 Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment passes because the alignment length is greater than 30% of the potential alignment length
  • 18. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 1 scaffolds + in silico CMAP 2 Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment passes because the alignment length is greater than 30% of the potential alignment length
  • 19. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 - in silico CMAP 2 BNG CMAP 2 Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment passes because the alignment length is greater than 30% of the potential alignment length
  • 20. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 - in silico CMAP 2 BNG CMAP 2 Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment fails because the alignment length is less than 30% of the potential alignment length
  • 21. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 2 BNG CMAP 2 Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment fails because the alignment length is less than 30% of the potential alignment length
  • 22. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 2 Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment passes because the alignment length is greater than 30% of the potential alignment length - in silico CMAP 3
  • 23. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 2 scaffolds Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment fails because the alignment length is less than 30% of the potential alignment length - in silico CMAP 3
  • 24. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 2 scaffolds Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment passes because the alignment length is greater than 30% of the potential alignment length + in silico CMAP 4
  • 25. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 4 high quality scaffolding alignments... + in silico CMAP 1
  • 26. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds are filtered for longest and highest confidence alignment for each in silico CMAP BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 4 + in silico CMAP 1 + in silico CMAP 4 high quality scaffolding alignments... + in silico CMAP 1
  • 27. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds are filtered for longest and highest confidence alignment for each in silico CMAP Passing alignments are used to super scaffold BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 4 + in silico CMAP 1 + in silico CMAP 4 BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 1 + in silico CMAP 4 high quality scaffolding alignments... + in silico CMAP 1
  • 28. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds Stitch is iterated and additional super scaffolding alignments are found BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 1 + in silico CMAP 4 Iteration takes advantage of alignments where sequence-based scaffolds stitch BNG consensus maps
  • 29. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds Stitch is iterated and additional super scaffolding alignments are found BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 1 + in silico CMAP 4 Until all super scaffolds are BNG CMAP 1 BNG CMAP 2 joined + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 1 + in silico CMAP 4 Iteration takes advantage of alignments where sequence-based scaffolds stitch BNG consensus maps
  • 30. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds BNG CMAP 1 BNG CMAP 2 - in silico CMAP 3 + in silico CMAP 2 + in silico CMAP 4 + in silico CMAP 1 If gap length is estimated to be negative gaps are represented by 100 (bp) fillers
  • 31. Gap lengths Distribution of gap lengths for automated output Gap length (bp) Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp)) ! Of the manually edited Tribolium super-scaffolds there were 66 gaps had known lengths and 24 had negative lengths (set to 100 (bp)) Count −1500000 −1000000 −500000 0 500000 1000000 0 5 10 15 20 Negative gap lengths Positive gap lengths
  • 32. Gap lengths Distribution of gap lengths for automated output Gap length (bp) Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp)) ! Of the manually edited Tribolium super-scaffolds there were 66 gaps had known lengths and 24 had negative lengths (set to 100 (bp)) Count −1500000 −1000000 −500000 0 500000 1000000 0 5 10 15 20 Negative gap lengths Positive gap lengths
  • 33. Negative gap lengths Is part of scaffold_23 connected to 136?! I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should check these assemblies. ! ! In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG assembly? 22 23 129 136 137 The longest negative gap length is from a BNG consenus map joining in silico 23 and 136
  • 34. Negative gap lengths Is part of scaffold_23 connected to 136?! I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should check these assemblies. ! ! In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG assembly? 22 23 129 136 137 ! Because the same region of 136 aligns to another BNG consensus map that aligns to its chromosome linkage group this alignment was rejected and stitch was re-run
  • 35. scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards? Negative gap lengths ChLG 2 super! scaffold ChLG 2 super! scaffold BNG consensus maps BNG consensus maps ChLG 2! scaffolds 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145 Two new super scaffolds were created and the sequence similarity is being evaluated min confidence 10 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards? ChLG 2! scaffolds 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145 U 18 14 16 19 20 21 22 23 24 25 26 27 28 30 BNG consensus maps U 18 14 16 19 20 21 22 23 24 25 26 27 28 30 BNG consensus maps
  • 36. Gap lengths Distribution of gap lengths for automated output Gap length (bp) This negative alignment also indicated a potential assembly issue Count −1500000 −1000000 −500000 0 500000 1000000 0 5 10 15 20 Negative gap lengths Positive gap lengths
  • 37. Negative gap lengths This negative gap length is from a BNG consenus map joining in silico 81 and 102 and 103 Half of scaffold_81 aligns with ChLG7
  • 38. Negative gap lengths Half of scaffold_81 aligns with ChLG7 79 80 81 82 83 Because the other half of 81 aligns to another BNG consensus map that aligns to its chromosome linkage group this alignment was rejected and stitch was re-run ! The BNG maps suggest a mis-assembly of in silico 81 at a sequence level
  • 39. Distribution of gap lengths for automated output Gap length (bp) Count −1500000 −1000000 −500000 0 500000 1000000 0 5 10 15 20 Negative gap lengths Positive gap lengths Gap lengths All extremely small negative gap lengths, < -40,000 (bp) (shaded), were independently flagged as potential sequence mis-assemblies to be checked at the sequence-level
  • 40. Distribution of gap lengths for automated output Gap length (bp) Count −1500000 −1000000 −500000 0 500000 1000000 0 5 10 15 20 Negative gap lengths Positive gap lengths Gap lengths All gaps from the shaded regions were also manually rejected and stitch.pl was rerun without them for the current super-scaffolded assembly ! We suspect extremely small negative gap sizes may be useful in locating sequence mis-assemblies
  • 41. Tribolium super-scaffolds Input file N50 (Mb) Number of Contigs Cumulative Length (Mb) genome FASTA 1.16 2240 160.74 super-scaffold FASTA 4.46 2150 165.92 N50 of the super-scaffolded genome was ~4 times greater than the original ! Super-scaffolds tend to agree with the Tribolium genetic map
  • 42. Tribolium super-scaffolds Input file N50 (Mb) Number of Contigs genome FASTA 1.16 2240 160.74 4.46 2150 165.92 For Tribolium : first minimum percent aligned = 30% first minimum confidence = 13 Cumulative Length (Mb) second minimum percent aligned = 90% second minimum confidence = 8 ! super-scaffold FASTA Lower quality alignments were manually selected if genetic map also supported the order Complex scaffolds were broken manually for sequence level evaluation
  • 43. Tribolium super-scaffolds min confidence 10 From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold. ChLG X was reduced from 13 scaffolds to 2 with one scaffold being moved to ChLG 3 ChLG X super! scaffold BNG consensus maps ChLG X! scaffolds BNG consensus maps U 3 4 5 6 7 U 8 9 10 11 12 13
  • 44. Tribolium super-scaffolds min confidence 10 51 U 43 45 44 46 The second scaffold from ChLG X aligned to scaffolds from a portion of ChLG 3 ChLG 3 super! scaffold BNG consensus maps ChLG 3! scaffolds BNG consensus maps 32 33 34 35 36 2 37 38 39 40 41 42 ChLG 3 super! scaffold BNG consensus maps ChLG 3 super! scaffold BNG consensus
  • 45. Tribolium super-scaffolds min confidence 10 From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold. Two unplaced scaffolds aligned to ChLG X ChLG X super! scaffold BNG consensus maps ChLG X! scaffolds BNG consensus maps U 3 4 5 6 7 U 8 9 10 11 12 13
  • 46. Tribolium super-scaffolds min confidence 10 From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold. 4% Redundancy in alignment may be from assembly of haplotypes (generally observed as two BNG consensus maps aligning to the same in silico map) ChLG X super! scaffold BNG consensus maps ChLG X! scaffolds BNG consensus maps U 3 4 5 6 7 U 8 9 10 11 12 13
  • 47. Tribolium super-scaffolds overlapping BNG cmap min confidence 10 From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold. ChLG X super! scaffold BNG consensus maps ChLG X! scaffolds BNG consensus maps U 3 4 5 6 7 U 8 9 10 11 12 13
  • 48. Tribmoinl icuonmfid esnucep 1e0r-scaffolds ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards? 128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145 For ChLG 9 21 scaffolds were reduced to 9 ChLG 9 super! scaffold BNG consensus maps ChLG 9! scaffolds BNG consensus maps
  • 49. min confidence 10 Tribolium super-scaffolds For ChLG 5 17 scaffolds were reduced to 4 ChLG 5 super! scaffold BNG consensus maps ChLG 5! scaffolds BNG consensus maps 69 68 70 71 72 73 74 U 75 76 77 78 79 80 81 82 83
  • 50. Acknowledgements K-INBRE Bioinformatics Core! Susan Brown - PI Nic Herndon - script development Nanyan Lu - manual editing Michelle Coleman - extractions and running the Irys! Zachary Sliefert - metric summaries ! Bionano Genomics! Ernest Lam - assembly pipeline best practices assistance Weiping Wang - assistance with data formats Palak Sheth - collaboration to standardize analysis ! Script availability! https://github.com/i5K-KINBRE-script-share/Irys-scaffolding BNG scripts available by request from BNG
  • 51. Gap lengths Distribution of gap lengths for automated output Gap length (bp) Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp)) ! Of the manually edited Tribolium super-scaffolds there were 66 gaps had known lengths and 24 had negative lengths (set to 100 (bp)) Count −1500000 −1000000 −500000 0 500000 1000000 0 5 10 15 20 Negative gap lengths Positive gap lengths