SlideShare a Scribd company logo
Introduction
                             OLC
        Graph theory and assembly
                  deBruijn - Euler




Genome Assembly Algorithms and Software
   (or...what to do with all that sequence data ?)


                   Konstantinos Krampis

                    Asst. Professor, Informatics
                     J. Craig Venter Institute




   George Washington University, Nov. 2nd 2011



            Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction
                                    OLC
               Graph theory and assembly
                         deBruijn - Euler

Introduction
    Why do we need genome assembly
    Definitions of genome assembly
OLC
    Overlap
    Layout
    Consensus
    OLC assembly software and publications
Graph theory and assembly
    Definition of a graph
    Graphs and genome assembly
deBruijn - Euler
    An alternative assembly graph
    Constructing a de Bruijn graph from reads
    Genome assembly from de Bruijn graphs
    deBruijn assembly software and publications
                   Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction
                                      OLC          Why do we need genome assembly
                 Graph theory and assembly         Definitions of genome assembly
                           deBruijn - Euler



Cannot read the complete genome
with the sequencer from one end to
the other !

DNA isolated from a cell is
amplified

Broken into fragments (shearing)

Fragments are ”read” with the
sequencer

Use the fragments - reads to
reconstruct the genome from
                                              Credit: Masahiro Kasahara, Large-Scale Genome Sequence
sequencing reads
                                              Processing, Imprerial College Press


                     Konstantinos Krampis          Genome Assembly Algorithms and Software
Introduction
                                     OLC          Why do we need genome assembly
                Graph theory and assembly         Definitions of genome assembly
                          deBruijn - Euler



Assembly: hierarchical process
to reconstruct genome from
reads

Assemble the puzzle of the
genome from the reads:
overlaps connect the pieces

Oversample the genome so that
reads overlap

Key approach: data structure
representing overlaps, and
algorithms operating on that                 Credit: Masahiro Kasahara, Large-Scale Genome Sequence

data structure                               Processing, Imprerial College Press


                    Konstantinos Krampis          Genome Assembly Algorithms and Software
Introduction
                                    OLC     Why do we need genome assembly
               Graph theory and assembly    Definitions of genome assembly
                         deBruijn - Euler


Two major algorithmic paradigms for genome assembly


       Overlap - Layout - Consensus (OLC): well established,
       more powerful method, but more difficult to implement

       OLC: first to be used successfully for complex Eucaryotic
       genomes (Drosophila,H.sapiens)

       deBruijn - Euler: newer, easier to implement, problematic
       in complex genomes (for current implementations)




                   Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction      Overlap
                                   OLC        Layout
              Graph theory and assembly       Consensus
                        deBruijn - Euler      OLC assembly software and publications



Find Overlaps by aligning
the sequence of the reads

Layout the reads based
on which aligns to which

Get Consensus by joining
all read sequences,
merging overlaps

Sequencer reads in
random direction,
left-to-right or                  Credit: Masahiro Kasahara, Large-Scale Genome Sequence Processing,

right-to-left                     Imprerial College Press




                  Konstantinos Krampis        Genome Assembly Algorithms and Software
Introduction     Overlap
                                    OLC       Layout
               Graph theory and assembly      Consensus
                         deBruijn - Euler     OLC assembly software and publications



Sequence alignment,
all-against-all reads
(Smith-Watermann,
BLAST, other?)

Computationally intensive
but easily parallelizable

Represent read overlap by
connecting with directed           Credit: Kececioglu and Myers 1995, Algorithmica 13:7-51
link

First step in creating the
genome assembly graph
(more later)
                   Konstantinos Krampis       Genome Assembly Algorithms and Software
Introduction   Overlap
                                      OLC     Layout
                 Graph theory and assembly    Consensus
                           deBruijn - Euler   OLC assembly software and publications




Create a consistent linear
(ideally) ordering of the
reads


Remove redundancy, so
no two dovetails leave
the same edge

No containment edge is
followed by a dovetail
edge


Remove cycles, one link
in, one out


                     Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   Overlap
                                      OLC     Layout
                 Graph theory and assembly    Consensus
                           deBruijn - Euler   OLC assembly software and publications




Multiple Sequence
Alignment (ClustalW)
algorithms ? No
phylogeny here...

Vote for the most abundant
nucleotide for each position

Incorporate read quality data


Create pre-consensus from
high-quality reads, and align
remaining reads to it



                     Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   Overlap
                                       OLC     Layout
                  Graph theory and assembly    Consensus
                            deBruijn - Euler   OLC assembly software and publications


Celera Assembler

   Developed at Celera Genomics for first Drosophila and human genome
   assemblies

   Continuoued development at J. Craig Venter Inst. as open source project

   http://wgs-assembler.SourceForge.net (Licence: GPL)

   Plently of wiki (developer + user) documentation, examples, user forums

   Other OLC implementations: Arachne, PCAP, Newbler, Phrap, TIGR
   Assembler



                      Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   Overlap
                                         OLC     Layout
                    Graph theory and assembly    Consensus
                              deBruijn - Euler   OLC assembly software and publications


Celera Assembler publications

    Myers et al (2000) A whole-genome assembly of Drosophila
    Levy et al (2007) The diploid genome sequence of an individual human
    Zimin et al (2009) The domestic cow, Bos taurus
    Dalloul et al (2010) The domestic turkey, Meleagris gallopavo
    Lorenzi et al (2010) New assembly of Entamoeba histolytica
    Lawniczak et al (2010) Divergence in Anopheles gambiae
    Jones et al (2011) The marine filamentous cyanobacterium Lyngbya
    majuscula
    Miller et al The Tasmanian devil, Sarcophilus harrisii
    Prfer et al The great ape bonobo, Pan paniscus
    Gordon et al The cotton bollworm moth, Helicoverpa
                        Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction
                                   OLC     Definition of a graph
              Graph theory and assembly    Graphs and genome assembly
                        deBruijn - Euler


and now a bit of Graph Theory...




                  Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction
                                     OLC     Definition of a graph
                Graph theory and assembly    Graphs and genome assembly
                          deBruijn - Euler




Graph G with set of vertices (nodes)
V: {P,T,Q,S,R}

set of edges (links between nodes)
E: {(P,T),(P,Q),(P,S),(Q,T),
(S,T),(Q,S),(S,Q),(Q,R),(R,S)}

walk from P to R:(P,Q),(Q,R)

walk from R to T:(R,S),(S,Q),(Q,T)
or (R,S),(S,T)                     Credit: Introduction to Graph Theor
                                   Robert J. Wilson
walk from R to P: not possible


                    Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction
                                   OLC     Definition of a graph
              Graph theory and assembly    Graphs and genome assembly
                        deBruijn - Euler




Trail: a walk of the graph where
each edge is visited only once

Example Trail: (P,Q), (Q,R),
(R,S), (S,Q), (Q,S), (S,T)

Path: a walk where each vertice
is visited once

Example Path: (P,Q), (Q,R),
(R,S), (S,T)



                  Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction
                                            OLC      Definition of a graph
                       Graph theory and assembly     Graphs and genome assembly
                                 deBruijn - Euler




Credit: Saad Mneimneh, CUNY




                              Konstantinos Krampis   Genome Assembly Algorithms and Software
Introduction
                                  OLC     Definition of a graph
             Graph theory and assembly    Graphs and genome assembly
                       deBruijn - Euler




Represent sequence overlaps as
a graph with weighted edges

SCS solution: find Path (visit
all edges and vertices once) that
maximizes weight sum

Hamiltonian Cycle or Traveling
Saleman Problem




                 Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction
                                           OLC     Definition of a graph
                      Graph theory and assembly    Graphs and genome assembly
                                deBruijn - Euler


Which edge to start from?




NO: misses a vertex                                NO: misses edge with large weight


                          Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction
                                    OLC     Definition of a graph
               Graph theory and assembly    Graphs and genome assembly
                         deBruijn - Euler




YES!: all vertices and edge with large weight


                   Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction
                                          OLC     Definition of a graph
                     Graph theory and assembly    Graphs and genome assembly
                               deBruijn - Euler




A more realistic version of a read / string overlap graph (C. jejuni)
Credit: Eugene W. Myers Bioinformatics 21:79-85


                         Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction
                                        OLC     Definition of a graph
                   Graph theory and assembly    Graphs and genome assembly
                             deBruijn - Euler


Computational Complexity

   SCS solution by searching for a
   Hamiltonian Cycle on a graph is a
   difficult algorithmic problem
   (NP-hard)

   Using approximation or greedy
   algorithms can yield a 2 to
   4-aprroximation solutions (twice or
   four times the length of the
   optimal-shortest string)

   Transformation of Overlap Graph
   to String Graph leads to
   Polynomial time solution. No                 Polynomial(P) : O(n), O(n2 ), O(n3 )etc.
   assembler implementation yet.                                                     (1)
                       Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   An alternative assembly graph
                                    OLC     Constructing a de Bruijn graph from reads
               Graph theory and assembly    Genome assembly from de Bruijn graphs
                         deBruijn - Euler   deBruijn assembly software and publications




Pevzner, Tang and
Waterman, An
Eulerian path
approach to DNA
fragment assembly,
PNAS 98 2001
9748-9753.




                     Konstantinos Krampis   Genome Assembly Algorithms and Software
Introduction   An alternative assembly graph
                                         OLC     Constructing a de Bruijn graph from reads
                    Graph theory and assembly    Genome assembly from de Bruijn graphs
                              deBruijn - Euler   deBruijn assembly software and publications




deBruijn graph: a directed graph representing overlaps between
sequences of symbols
Credit: Wikipedia

                        Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   An alternative assembly graph
                     OLC     Constructing a de Bruijn graph from reads
Graph theory and assembly    Genome assembly from de Bruijn graphs
          deBruijn - Euler   deBruijn assembly software and publications




    Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   An alternative assembly graph
                     OLC     Constructing a de Bruijn graph from reads
Graph theory and assembly    Genome assembly from de Bruijn graphs
          deBruijn - Euler   deBruijn assembly software and publications




    Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   An alternative assembly graph
                     OLC     Constructing a de Bruijn graph from reads
Graph theory and assembly    Genome assembly from de Bruijn graphs
          deBruijn - Euler   deBruijn assembly software and publications




    Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   An alternative assembly graph
                                         OLC     Constructing a de Bruijn graph from reads
                    Graph theory and assembly    Genome assembly from de Bruijn graphs
                              deBruijn - Euler   deBruijn assembly software and publications


In a real genome scenario...




Credit: Flicek and Birney 2009, Nature Methods 6, S6 - S12



                        Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   An alternative assembly graph
                                            OLC     Constructing a de Bruijn graph from reads
                       Graph theory and assembly    Genome assembly from de Bruijn graphs
                                 deBruijn - Euler   deBruijn assembly software and publications


Euler’s algorithm




   Using Euler’s algorithm we can find a path that visits each edge of the de
   Bruijn genome assembly graph once, in order to concatenate the edge
   labels and ”spell out” the assembly. Polynomial time!
   Credit: Wikipedia



                           Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   An alternative assembly graph
                                   OLC     Constructing a de Bruijn graph from reads
              Graph theory and assembly    Genome assembly from de Bruijn graphs
                        deBruijn - Euler   deBruijn assembly software and publications




Euler assembler (the very first), Pevzner et al 2001 PNAS
98:9748-9753

Velvet assembler (more user friendly),

Both those assemlers store the complete graph on the computer
memory 512GB-1024GB for human genomes

At JCVI we have two 1024GB (1TB) RAM servers for assembly

others: ABYSS, YAGA, Contrail-Bio, PASHA parallel (distributed
memory) assemblers on computer clusters



                  Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   An alternative assembly graph
                                     OLC     Constructing a de Bruijn graph from reads
                Graph theory and assembly    Genome assembly from de Bruijn graphs
                          deBruijn - Euler   deBruijn assembly software and publications


Thank you!


    contact: kkrampis@jcvi.org

    We hire interns at the J. Craig Venter Institute:
    http://www.jcvi.org/cms/education/internship-program/

    Some of my other projects - Cloud Computing:
    http://tinyurl.com/cloudbiolinux-jcvi
    http://www.cloudbiolinux.org




                    Konstantinos Krampis     Genome Assembly Algorithms and Software

More Related Content

PDF
Genome Assembly
PDF
NGS: Mapping and de novo assembly
PPT
Phylogenetic trees
PDF
Genome assembly: An Introduction (2016)
PDF
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
PDF
An introduction to RNA-seq data analysis
PDF
Functional annotation
Genome Assembly
NGS: Mapping and de novo assembly
Phylogenetic trees
Genome assembly: An Introduction (2016)
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
An introduction to RNA-seq data analysis
Functional annotation

What's hot (20)

PPTX
Comparative genomics
PDF
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
PPTX
Bioinformatics
PDF
Basics of Genome Assembly
PDF
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
PPTX
Next Generation Sequencing and its Applications in Medical Research - Frances...
PDF
Data analysis pipelines for NGS applications
PPTX
Next generation sequencing
PDF
Genome Assembly 2018
PPTX
Genome annotation
PPTX
Bioinfromatics - local alignment
PPSX
Next Generation Sequencing
PDF
The jackknife and bootstrap
PDF
Protein-protein interaction networks
PPT
Clustal X
PPTX
Comparative genomics
PPTX
Gemome annotation
PDF
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
PDF
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
PPT
Maximum parsimony
Comparative genomics
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
Bioinformatics
Basics of Genome Assembly
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
Next Generation Sequencing and its Applications in Medical Research - Frances...
Data analysis pipelines for NGS applications
Next generation sequencing
Genome Assembly 2018
Genome annotation
Bioinfromatics - local alignment
Next Generation Sequencing
The jackknife and bootstrap
Protein-protein interaction networks
Clustal X
Comparative genomics
Gemome annotation
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Maximum parsimony
Ad

Similar to Overview of Genome Assembly Algorithms (20)

PDF
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
PPT
2012 stamps-mbl-1
PDF
Report-de Bruijn Graph
PPTX
Genome Assembly copy
PDF
Genome assembly: the art of trying to make one big thing from millions of ver...
PPTX
Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)
PDF
Sequencing, Alignment and Assembly
PPTX
Sequence assembly
PDF
CS176: Genome Assembly
PPT
Assembling NGS Data - IMB Winter School - 3 July 2012
PPTX
Ngs de novo assembly progresses and challenges
PDF
04_Assembly_2022.pdf
PPTX
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
PPT
De novo genome assembly - IMB Winter School - 7 July 2015
PDF
New methods diploid assembly with graphs
PPTX
from genome sequencing to genome assembly
PPTX
2012 talk to CSE department at U. Arizona
PDF
20110524zurichngs 2nd pub
PPTX
U Florida / Gainesville talk, apr 13 2011
PPTX
2013 duke-talk
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
2012 stamps-mbl-1
Report-de Bruijn Graph
Genome Assembly copy
Genome assembly: the art of trying to make one big thing from millions of ver...
Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)
Sequencing, Alignment and Assembly
Sequence assembly
CS176: Genome Assembly
Assembling NGS Data - IMB Winter School - 3 July 2012
Ngs de novo assembly progresses and challenges
04_Assembly_2022.pdf
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
De novo genome assembly - IMB Winter School - 7 July 2015
New methods diploid assembly with graphs
from genome sequencing to genome assembly
2012 talk to CSE department at U. Arizona
20110524zurichngs 2nd pub
U Florida / Gainesville talk, apr 13 2011
2013 duke-talk
Ad

More from Ntino Krampis (8)

PDF
Ntino Cloud BioLinux Barcelona Spain 2012
PDF
CHPC Afternoon Session
PDF
CHPC Workshop Morning Session
ODP
Cloud BioLinux S.Africa
PDF
Cloud ntino-krampis
PDF
Ntino Krampis GSC 2011
PDF
Large scale data-parsing with Hadoop in Bioinformatics
PDF
Chi next gen-ntino-krampis
Ntino Cloud BioLinux Barcelona Spain 2012
CHPC Afternoon Session
CHPC Workshop Morning Session
Cloud BioLinux S.Africa
Cloud ntino-krampis
Ntino Krampis GSC 2011
Large scale data-parsing with Hadoop in Bioinformatics
Chi next gen-ntino-krampis

Recently uploaded (20)

PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Big Data Technologies - Introduction.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
cuic standard and advanced reporting.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Cloud computing and distributed systems.
A comparative analysis of optical character recognition models for extracting...
Per capita expenditure prediction using model stacking based on satellite ima...
Programs and apps: productivity, graphics, security and other tools
Big Data Technologies - Introduction.pptx
Encapsulation theory and applications.pdf
A Presentation on Artificial Intelligence
Agricultural_Statistics_at_a_Glance_2022_0.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
NewMind AI Weekly Chronicles - August'25-Week II
gpt5_lecture_notes_comprehensive_20250812015547.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
sap open course for s4hana steps from ECC to s4
Building Integrated photovoltaic BIPV_UPV.pdf
Network Security Unit 5.pdf for BCA BBA.
cuic standard and advanced reporting.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
“AI and Expert System Decision Support & Business Intelligence Systems”
Cloud computing and distributed systems.

Overview of Genome Assembly Algorithms

  • 1. Introduction OLC Graph theory and assembly deBruijn - Euler Genome Assembly Algorithms and Software (or...what to do with all that sequence data ?) Konstantinos Krampis Asst. Professor, Informatics J. Craig Venter Institute George Washington University, Nov. 2nd 2011 Konstantinos Krampis Genome Assembly Algorithms and Software
  • 2. Introduction OLC Graph theory and assembly deBruijn - Euler Introduction Why do we need genome assembly Definitions of genome assembly OLC Overlap Layout Consensus OLC assembly software and publications Graph theory and assembly Definition of a graph Graphs and genome assembly deBruijn - Euler An alternative assembly graph Constructing a de Bruijn graph from reads Genome assembly from de Bruijn graphs deBruijn assembly software and publications Konstantinos Krampis Genome Assembly Algorithms and Software
  • 3. Introduction OLC Why do we need genome assembly Graph theory and assembly Definitions of genome assembly deBruijn - Euler Cannot read the complete genome with the sequencer from one end to the other ! DNA isolated from a cell is amplified Broken into fragments (shearing) Fragments are ”read” with the sequencer Use the fragments - reads to reconstruct the genome from Credit: Masahiro Kasahara, Large-Scale Genome Sequence sequencing reads Processing, Imprerial College Press Konstantinos Krampis Genome Assembly Algorithms and Software
  • 4. Introduction OLC Why do we need genome assembly Graph theory and assembly Definitions of genome assembly deBruijn - Euler Assembly: hierarchical process to reconstruct genome from reads Assemble the puzzle of the genome from the reads: overlaps connect the pieces Oversample the genome so that reads overlap Key approach: data structure representing overlaps, and algorithms operating on that Credit: Masahiro Kasahara, Large-Scale Genome Sequence data structure Processing, Imprerial College Press Konstantinos Krampis Genome Assembly Algorithms and Software
  • 5. Introduction OLC Why do we need genome assembly Graph theory and assembly Definitions of genome assembly deBruijn - Euler Two major algorithmic paradigms for genome assembly Overlap - Layout - Consensus (OLC): well established, more powerful method, but more difficult to implement OLC: first to be used successfully for complex Eucaryotic genomes (Drosophila,H.sapiens) deBruijn - Euler: newer, easier to implement, problematic in complex genomes (for current implementations) Konstantinos Krampis Genome Assembly Algorithms and Software
  • 6. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publications Find Overlaps by aligning the sequence of the reads Layout the reads based on which aligns to which Get Consensus by joining all read sequences, merging overlaps Sequencer reads in random direction, left-to-right or Credit: Masahiro Kasahara, Large-Scale Genome Sequence Processing, right-to-left Imprerial College Press Konstantinos Krampis Genome Assembly Algorithms and Software
  • 7. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publications Sequence alignment, all-against-all reads (Smith-Watermann, BLAST, other?) Computationally intensive but easily parallelizable Represent read overlap by connecting with directed Credit: Kececioglu and Myers 1995, Algorithmica 13:7-51 link First step in creating the genome assembly graph (more later) Konstantinos Krampis Genome Assembly Algorithms and Software
  • 8. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publications Create a consistent linear (ideally) ordering of the reads Remove redundancy, so no two dovetails leave the same edge No containment edge is followed by a dovetail edge Remove cycles, one link in, one out Konstantinos Krampis Genome Assembly Algorithms and Software
  • 9. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publications Multiple Sequence Alignment (ClustalW) algorithms ? No phylogeny here... Vote for the most abundant nucleotide for each position Incorporate read quality data Create pre-consensus from high-quality reads, and align remaining reads to it Konstantinos Krampis Genome Assembly Algorithms and Software
  • 10. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publications Celera Assembler Developed at Celera Genomics for first Drosophila and human genome assemblies Continuoued development at J. Craig Venter Inst. as open source project http://wgs-assembler.SourceForge.net (Licence: GPL) Plently of wiki (developer + user) documentation, examples, user forums Other OLC implementations: Arachne, PCAP, Newbler, Phrap, TIGR Assembler Konstantinos Krampis Genome Assembly Algorithms and Software
  • 11. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publications Celera Assembler publications Myers et al (2000) A whole-genome assembly of Drosophila Levy et al (2007) The diploid genome sequence of an individual human Zimin et al (2009) The domestic cow, Bos taurus Dalloul et al (2010) The domestic turkey, Meleagris gallopavo Lorenzi et al (2010) New assembly of Entamoeba histolytica Lawniczak et al (2010) Divergence in Anopheles gambiae Jones et al (2011) The marine filamentous cyanobacterium Lyngbya majuscula Miller et al The Tasmanian devil, Sarcophilus harrisii Prfer et al The great ape bonobo, Pan paniscus Gordon et al The cotton bollworm moth, Helicoverpa Konstantinos Krampis Genome Assembly Algorithms and Software
  • 12. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - Euler and now a bit of Graph Theory... Konstantinos Krampis Genome Assembly Algorithms and Software
  • 13. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - Euler Graph G with set of vertices (nodes) V: {P,T,Q,S,R} set of edges (links between nodes) E: {(P,T),(P,Q),(P,S),(Q,T), (S,T),(Q,S),(S,Q),(Q,R),(R,S)} walk from P to R:(P,Q),(Q,R) walk from R to T:(R,S),(S,Q),(Q,T) or (R,S),(S,T) Credit: Introduction to Graph Theor Robert J. Wilson walk from R to P: not possible Konstantinos Krampis Genome Assembly Algorithms and Software
  • 14. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - Euler Trail: a walk of the graph where each edge is visited only once Example Trail: (P,Q), (Q,R), (R,S), (S,Q), (Q,S), (S,T) Path: a walk where each vertice is visited once Example Path: (P,Q), (Q,R), (R,S), (S,T) Konstantinos Krampis Genome Assembly Algorithms and Software
  • 15. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - Euler Credit: Saad Mneimneh, CUNY Konstantinos Krampis Genome Assembly Algorithms and Software
  • 16. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - Euler Represent sequence overlaps as a graph with weighted edges SCS solution: find Path (visit all edges and vertices once) that maximizes weight sum Hamiltonian Cycle or Traveling Saleman Problem Konstantinos Krampis Genome Assembly Algorithms and Software
  • 17. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - Euler Which edge to start from? NO: misses a vertex NO: misses edge with large weight Konstantinos Krampis Genome Assembly Algorithms and Software
  • 18. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - Euler YES!: all vertices and edge with large weight Konstantinos Krampis Genome Assembly Algorithms and Software
  • 19. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - Euler A more realistic version of a read / string overlap graph (C. jejuni) Credit: Eugene W. Myers Bioinformatics 21:79-85 Konstantinos Krampis Genome Assembly Algorithms and Software
  • 20. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - Euler Computational Complexity SCS solution by searching for a Hamiltonian Cycle on a graph is a difficult algorithmic problem (NP-hard) Using approximation or greedy algorithms can yield a 2 to 4-aprroximation solutions (twice or four times the length of the optimal-shortest string) Transformation of Overlap Graph to String Graph leads to Polynomial time solution. No Polynomial(P) : O(n), O(n2 ), O(n3 )etc. assembler implementation yet. (1) Konstantinos Krampis Genome Assembly Algorithms and Software
  • 21. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications Pevzner, Tang and Waterman, An Eulerian path approach to DNA fragment assembly, PNAS 98 2001 9748-9753. Konstantinos Krampis Genome Assembly Algorithms and Software
  • 22. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications deBruijn graph: a directed graph representing overlaps between sequences of symbols Credit: Wikipedia Konstantinos Krampis Genome Assembly Algorithms and Software
  • 23. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications Konstantinos Krampis Genome Assembly Algorithms and Software
  • 24. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications Konstantinos Krampis Genome Assembly Algorithms and Software
  • 25. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications Konstantinos Krampis Genome Assembly Algorithms and Software
  • 26. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications In a real genome scenario... Credit: Flicek and Birney 2009, Nature Methods 6, S6 - S12 Konstantinos Krampis Genome Assembly Algorithms and Software
  • 27. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications Euler’s algorithm Using Euler’s algorithm we can find a path that visits each edge of the de Bruijn genome assembly graph once, in order to concatenate the edge labels and ”spell out” the assembly. Polynomial time! Credit: Wikipedia Konstantinos Krampis Genome Assembly Algorithms and Software
  • 28. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications Euler assembler (the very first), Pevzner et al 2001 PNAS 98:9748-9753 Velvet assembler (more user friendly), Both those assemlers store the complete graph on the computer memory 512GB-1024GB for human genomes At JCVI we have two 1024GB (1TB) RAM servers for assembly others: ABYSS, YAGA, Contrail-Bio, PASHA parallel (distributed memory) assemblers on computer clusters Konstantinos Krampis Genome Assembly Algorithms and Software
  • 29. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications Thank you! contact: kkrampis@jcvi.org We hire interns at the J. Craig Venter Institute: http://www.jcvi.org/cms/education/internship-program/ Some of my other projects - Cloud Computing: http://tinyurl.com/cloudbiolinux-jcvi http://www.cloudbiolinux.org Konstantinos Krampis Genome Assembly Algorithms and Software