SlideShare a Scribd company logo
7
Most read
10
Most read
15
Most read
Combining de Bruijn graph,
           overlap graph and
       microassembly for de novo
           genome assembly
   A. Alexandrov, S. Kazakov, S. Melnikov,
    A. Sergushichev, P. Fedotov, F. Tsarev,
                 A. Shalyto
Genome Assembly Algorithms Laboratory
 St. Petersburg National Research University of Information
            Technologies, Mechanics and Optics

                    Kazan, 23 Nov 2012
Algorithm

                        De Bruijn graph




  Error       Quasi-        Initial        Contig
correction    contig       contig          micro-    Scaffolding
             assembly     assembly        assembly



                                 Overlap graph




                                                                   2
Error correction
• K-mers – substrings of length k.
• “Trusted” and “untrusted” k-mers.
• Replace “untrusted” k-mers with the
  “trusted” ones.
• If all the k-mers don’t fit into memory.
   • Divide them into buckets.
   • Process the buckets independently.


                                             3
Quasicontig assembly

ATGC     ???     GTCC




ATGC ATGCATGCAGTG GTCC


                         4
De Bruijn graph




                  5
De Bruijn graph example (1)




                         6
De Bruijn graph example (2)

      GTC   TCA   CAT   ATC   TCC

AGT   GTG                     CCA
                        CAC   CAA
GAG   GGA   AGG   CAG   ACA   AAC



                              7
Quasicontig assembly
• Build the de Bruijn graph.
• For each pair of reads (r1, r2) find the path
  between the first k-mer of r1 and the last k-
  mer of r2.
• The path has to be of appropriate length.
• The path has to be unique.



                                                  8
De Bruijn graph example (3)




                         9
De Bruijn graph example (4)




                        10
Unique paths correspond to
       quasicontigs
Initial contig assembly
• Overlap
  – Suffix array
  – Inexact overlaps
• Layout
  – Overlap graph
• Consensus




                                  12
Contig microassembly



• There are paired reads that map to different
  contigs.
• There are pairs of reads, one of which maps to
  one of the contigs and the other one maps to the
  gap between the contigs.

                                           13
Contig microassembly algorithm
• Use Bowtie to find the positions of reads in
  contigs.
• Find all the pairs of contigs connected by many
  reads.
• Build the de Bruijn graph using the reads that
  map to at least one of the chosen contigs.
• Use the quasicontig assembly algorithm to fill
  the gap.


                                           14
Results
• E. Coli genome – 4,5 million nucleotides.
• SRR001665 library, fragment size – 200, read
  length – 36, coverage – 160.
• Before microassembly – 525 contigs, N50 =
  17804.
• After microassembly – 247 contigs, N50 =
  53720.
• ABySS – 632 contigs, N50 = 64280.


                                          15
Web-service
• http://genome.ifmo.ru/cloud




                                16
Acknowledgements
• K. Skryabin, E. Prokhorchuk from
  “Bioengineering” center, for introduction to
  bioinformatics.
• D. Alexeev, from NRI PCM, for the invitation to
  this conference.




                                                    17

More Related Content

PPTX
Cytoscape basic features
ODP
Biopython
PPTX
시스템 보안에 대해 최종본
PPTX
De bruijn graphs
PPTX
Genome variation graphs with the vg toolkit
PPTX
Multiple sequence alignment
PDF
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
PPTX
Complementarity Determining Regions
Cytoscape basic features
Biopython
시스템 보안에 대해 최종본
De bruijn graphs
Genome variation graphs with the vg toolkit
Multiple sequence alignment
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Complementarity Determining Regions

What's hot (20)

PPTX
Prokka - rapid bacterial genome annotation - ABPHM 2013
PPTX
Abo blood grouping
PDF
PacBio SMRT - THIRD GENERATION SEQUENCING TECHNIQUE
PPTX
Cells of immune system
PPTX
Differential gene profiling methods
PDF
Basics of Genome Assembly
PPTX
Protein interaction, types by kk sahu
PPT
Basics of immunohematology - copy
PPTX
FACTORS INFLUENCING ANTIGEN-ANTIBODY REACTIONS.pptx
PPTX
Gel card technology ppt nc
PPTX
Gene hunting strategies
PPTX
Gemome annotation
PDF
Enzyme linked immunosorbent assay (elisa) and its clinical significance
PPTX
Data Management for Quantitative Biology - Data sources (Next generation tech...
DOC
Molecular hybridization
PDF
Quantitative real time pcr
PPTX
Gene expression profiling
PPTX
Functional proteomics, methods and tools
Prokka - rapid bacterial genome annotation - ABPHM 2013
Abo blood grouping
PacBio SMRT - THIRD GENERATION SEQUENCING TECHNIQUE
Cells of immune system
Differential gene profiling methods
Basics of Genome Assembly
Protein interaction, types by kk sahu
Basics of immunohematology - copy
FACTORS INFLUENCING ANTIGEN-ANTIBODY REACTIONS.pptx
Gel card technology ppt nc
Gene hunting strategies
Gemome annotation
Enzyme linked immunosorbent assay (elisa) and its clinical significance
Data Management for Quantitative Biology - Data sources (Next generation tech...
Molecular hybridization
Quantitative real time pcr
Gene expression profiling
Functional proteomics, methods and tools
Ad

Viewers also liked (14)

PDF
De Bruijn Superwalk with Multiplicities Problem is NP-hard
PDF
Maximum Likelihood Scaffold Assembly
PDF
On NP-Hardness of the Paired de Bruijn Sound Cycle Problem
PDF
Sequencing, Alignment and Assembly
PDF
Overview of Genome Assembly Algorithms
PPTX
20101209 dnaseq pevzner
PPTX
How we revealed genomes secrets?
PPTX
de Bruijn Graph Construction from Combination of Short and Long Reads
PPTX
Colored de Bruijn Graphs
PDF
Genome assembly: An Introduction (2016)
PDF
De Bruijn Sequences for Fun and Profit
PPT
Primer design task
PDF
Pcr primer design english version
PPT
PCR Primer desining
De Bruijn Superwalk with Multiplicities Problem is NP-hard
Maximum Likelihood Scaffold Assembly
On NP-Hardness of the Paired de Bruijn Sound Cycle Problem
Sequencing, Alignment and Assembly
Overview of Genome Assembly Algorithms
20101209 dnaseq pevzner
How we revealed genomes secrets?
de Bruijn Graph Construction from Combination of Short and Long Reads
Colored de Bruijn Graphs
Genome assembly: An Introduction (2016)
De Bruijn Sequences for Fun and Profit
Primer design task
Pcr primer design english version
PCR Primer desining
Ad

Similar to Combining de Bruijn graph, overlap graph and microassembly for de novo genome assembly (20)

PDF
Assembling genomes using ABySS
PPTX
Genome Assembly copy
PDF
Report-de Bruijn Graph
PPTX
2012 talk to CSE department at U. Arizona
PPTX
Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)
PPTX
Masurca genome assembly with super reads
PPT
2012 stamps-mbl-1
PDF
CS176: Genome Assembly
PPTX
Learning to Love De Bruijn Graphs
PDF
De novo assemble for NGS
PPTX
2015 vancouver-vanbug
PPTX
2012 XLDB talk
PDF
04_Assembly_2022.pdf
PDF
Talk at dnGASP workshop, April 5, 2011
PPTX
Ngs de novo assembly progresses and challenges
PDF
Genome Assembly 2018
PPTX
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
PDF
Comparative Genomics and de Bruijn graphs
PPTX
Probabilistic breakdown of assembly graphs
PPTX
Combining PacBio with short read technology for improved de novo genome assembly
Assembling genomes using ABySS
Genome Assembly copy
Report-de Bruijn Graph
2012 talk to CSE department at U. Arizona
Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)
Masurca genome assembly with super reads
2012 stamps-mbl-1
CS176: Genome Assembly
Learning to Love De Bruijn Graphs
De novo assemble for NGS
2015 vancouver-vanbug
2012 XLDB talk
04_Assembly_2022.pdf
Talk at dnGASP workshop, April 5, 2011
Ngs de novo assembly progresses and challenges
Genome Assembly 2018
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
Comparative Genomics and de Bruijn graphs
Probabilistic breakdown of assembly graphs
Combining PacBio with short read technology for improved de novo genome assembly

Combining de Bruijn graph, overlap graph and microassembly for de novo genome assembly

  • 1. Combining de Bruijn graph, overlap graph and microassembly for de novo genome assembly A. Alexandrov, S. Kazakov, S. Melnikov, A. Sergushichev, P. Fedotov, F. Tsarev, A. Shalyto Genome Assembly Algorithms Laboratory St. Petersburg National Research University of Information Technologies, Mechanics and Optics Kazan, 23 Nov 2012
  • 2. Algorithm De Bruijn graph Error Quasi- Initial Contig correction contig contig micro- Scaffolding assembly assembly assembly Overlap graph 2
  • 3. Error correction • K-mers – substrings of length k. • “Trusted” and “untrusted” k-mers. • Replace “untrusted” k-mers with the “trusted” ones. • If all the k-mers don’t fit into memory. • Divide them into buckets. • Process the buckets independently. 3
  • 4. Quasicontig assembly ATGC ??? GTCC ATGC ATGCATGCAGTG GTCC 4
  • 6. De Bruijn graph example (1) 6
  • 7. De Bruijn graph example (2) GTC TCA CAT ATC TCC AGT GTG CCA CAC CAA GAG GGA AGG CAG ACA AAC 7
  • 8. Quasicontig assembly • Build the de Bruijn graph. • For each pair of reads (r1, r2) find the path between the first k-mer of r1 and the last k- mer of r2. • The path has to be of appropriate length. • The path has to be unique. 8
  • 9. De Bruijn graph example (3) 9
  • 10. De Bruijn graph example (4) 10
  • 11. Unique paths correspond to quasicontigs
  • 12. Initial contig assembly • Overlap – Suffix array – Inexact overlaps • Layout – Overlap graph • Consensus 12
  • 13. Contig microassembly • There are paired reads that map to different contigs. • There are pairs of reads, one of which maps to one of the contigs and the other one maps to the gap between the contigs. 13
  • 14. Contig microassembly algorithm • Use Bowtie to find the positions of reads in contigs. • Find all the pairs of contigs connected by many reads. • Build the de Bruijn graph using the reads that map to at least one of the chosen contigs. • Use the quasicontig assembly algorithm to fill the gap. 14
  • 15. Results • E. Coli genome – 4,5 million nucleotides. • SRR001665 library, fragment size – 200, read length – 36, coverage – 160. • Before microassembly – 525 contigs, N50 = 17804. • After microassembly – 247 contigs, N50 = 53720. • ABySS – 632 contigs, N50 = 64280. 15
  • 17. Acknowledgements • K. Skryabin, E. Prokhorchuk from “Bioengineering” center, for introduction to bioinformatics. • D. Alexeev, from NRI PCM, for the invitation to this conference. 17