SlideShare a Scribd company logo
Assembling genomes using ABySS
           dnGASP 2011



            Shaun Jackman
      BC Genome Sciences Centre
         sjackman@bcgsc.ca
        abyss-users@bcgsc.ca
An assembly in two stages
●   Stage I: Sequence assembly algorithm
●   Stage II: Paired-end assembly algorithm




                                              2
Stage 1
      Sequence assembly algorithm
●   Load the reads,                  Load k-mers
    breaking each read into k-mers
●   Find adjacent k-mers, which      Find overlaps
    overlap by k-1 bases
●   Remove k-mers resulting from     Prune tips
    read errors
●   Remove variant sequences         Pop bubbles

●   Generate contigs
                                     Generate contigs



                                                        3
Load the reads
●   For each input read of length l, (l - k + 1) k-mers
    are generated by sliding a window of length k
    over the read
      Read (l = 12):    ● Each k-mer is a vertex of
         ATCATACATGAT   the de Bruijn graph
      k-mers (k = 9):
         ATCATACAT      ●Two adjacent k-mers are
          TCATACATG     an edge of the de Bruijn
           CATACATGA
            ATACATGAT   graph

                                                      4
De Bruijn Graph
●   A simple graph for k = 5
●   Two reads
        –   GGACATC
        –   GGACAGA
                           GACAT      ACATC
            GGACA


                           GACAG      ACAGA


                                              5
Pruning tips
●   Read errors cause
    tips




                                6
Pruning tips
●   Read errors cause
    tips
●   Pruning tips
    removes the
    erroneous reads
    from the assembly




                                7
Popping bubbles
●   Variant sequences cause
    bubbles
●   Popping bubbles removes
    the variant sequence from
    the assembly
●   Repeat sequences with
    small differences also
    cause bubbles




                                 8
Assemble contigs
●   Remove ambiguous
    edges
●   Output contigs in
    FASTA format




                                  9
Paired-end assembly algorithm
                       Stage 2
●   Align the reads to the contigs of the first stage
●   Generate an empirical fragment-size
    distribution using the paired reads that align to
    the same contig
●   Estimate the distance between contigs using
    the paired reads that align to different contigs




                                                        10
Align the reads to the contigs
                      KAligner
●   Every k-mer in the single-end
    assembly is unique
●   KAligner can map reads with k
    consecutive correct bases
●   ABySS may use other aligners,
    including BWA and bowtie




                                        11
Empirical fragment-size distribution
                     ParseAligns
●   Generate an empirical fragment-size
    distribution using the paired reads that align to
    the same contig




                                                        12
Estimate distances between contigs
                     DistanceEst
●   Estimate the distance between contigs using
    the paired reads that align to different contigs

                           d = 25 ± 8

                      d=3±5


                        d=6±5




                        d=4±3

                                                       13
Maximum likelihood estimator
                    DistanceEst
●   Use the empirical paired-
    end size distribution
●   Maximize the likelihood
    function
●   Find the most likely
    distance between the two
    contigs



                                     14
Paired-end algorithm
                   continued...
●   Find paths through the contig
    adjacency graph that agree with    Generate paths
    the distance estimates
●   Merge overlapping paths             Merge paths

●   Merge the contigs in these paths
                                       Generate contigs
    and output the FASTA file




                                                      15
Find consistent paths
                    SimpleGraph
●   Find paths through the contig adjacency graph
    that agree with the distance estimates




                     d=4±3

                  Actual distance = 3
                                                    16
Merge overlapping paths
                    MergePaths
●   Merge paths that overlap




                                   17
Generate the FASTA output
●   Merge the contigs in these paths.
●   Output the FASTA file




    GATTTTTG   GAC GTCTTGATCTT   CAC    GTATTG CTATT

                                                       18
Assembly process
●   Stage 1 completed in 3.5 hours
●   Used 72 processors on six machines
●   Peak memory usage of 180 GB of RAM
●   Stage 2 completed in 9 hours
●   Used 12 processors on one machine
●   Peak memory usage of 48 GB of RAM
●   Assembly parameters k=64 s=200 n=10

                                          19
Assembly results
          Level 1: 500-bp paired-end reads
●   Assembled half the genome in 7,676 contigs
    larger than the N50 of 50,612 bp
●   Assembled 1.81 Gbp in 170,407 contigs larger
    than 200 bp
●   The largest contig is 1,158,576 bp
●   Removed 1,296,819 variant sequences




                                                   20
Alignments to the reference
●   Aligned the 170,407 contigs longer than 200 bp
●   96.2% align at least 99% length
●   1.2% align between 90% and 99% length
●   2.5% align less than 90% length


                               >99%
                               90-99%
                               <90%




                                                 21
Works in progress
●   Replace complex variant sequences with Ns
●   Scaffold over gaps and simple repeat sequence
    using large fragment mate-pair reads
●   Filling in gaps with sequence using localized
    microassembly




                                                    22
ABySS Publications
         IEEE InfoVis 2009
Acknowledgments
    Supervisors
●   İnanç Birol
●   Steven Jones
    Team
●   Readman Chiu
●   Rod Docking
●   Karen Mungall
●   Jenny Qian
                                24
25

More Related Content

PDF
Sequencing, Alignment and Assembly
PDF
CS176: Genome Assembly
PDF
2011-04-26_01-velvet-curtain-presentation
PDF
Helsinki genome project-20151210-amb
PPSX
PDF
Genome assembly: the art of trying to make one big thing from millions of ver...
PDF
Genome Assembly
PPS
Jellyfish
Sequencing, Alignment and Assembly
CS176: Genome Assembly
2011-04-26_01-velvet-curtain-presentation
Helsinki genome project-20151210-amb
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome Assembly
Jellyfish

Similar to Assembling genomes using ABySS (20)

PPTX
2012 talk to CSE department at U. Arizona
PPTX
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
PDF
Talk at dnGASP workshop, April 5, 2011
PPTX
20101209 dnaseq pevzner
PDF
De novo assemble for NGS
PDF
Maximum Likelihood Scaffold Assembly
PPTX
Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)
PDF
20110524zurichngs 1st pub
PPT
Blast fasta 4
PPTX
2013 siam-cse-big-data
PPTX
Rnaseq forgenefinding
PDF
Ngs intro_v6_public
PPTX
Bioinformatica t5-database searching
PPT
2012 stamps-mbl-1
PPTX
Alignment of raw reads in Avadis NGS
PDF
20110524zurichngs 2nd pub
PPT
2013 pag-equine-workshop
PDF
Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City,...
PPTX
Masurca genome assembly with super reads
PPTX
Genome Assembly copy
2012 talk to CSE department at U. Arizona
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Talk at dnGASP workshop, April 5, 2011
20101209 dnaseq pevzner
De novo assemble for NGS
Maximum Likelihood Scaffold Assembly
Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)
20110524zurichngs 1st pub
Blast fasta 4
2013 siam-cse-big-data
Rnaseq forgenefinding
Ngs intro_v6_public
Bioinformatica t5-database searching
2012 stamps-mbl-1
Alignment of raw reads in Avadis NGS
20110524zurichngs 2nd pub
2013 pag-equine-workshop
Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City,...
Masurca genome assembly with super reads
Genome Assembly copy
Ad

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Empathic Computing: Creating Shared Understanding
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
cuic standard and advanced reporting.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Machine Learning_overview_presentation.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
Teaching material agriculture food technology
PPTX
A Presentation on Artificial Intelligence
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Empathic Computing: Creating Shared Understanding
MYSQL Presentation for SQL database connectivity
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
cuic standard and advanced reporting.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Spectral efficient network and resource selection model in 5G networks
Machine Learning_overview_presentation.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
NewMind AI Weekly Chronicles - August'25-Week II
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
sap open course for s4hana steps from ECC to s4
Diabetes mellitus diagnosis method based random forest with bat algorithm
Teaching material agriculture food technology
A Presentation on Artificial Intelligence
20250228 LYD VKU AI Blended-Learning.pptx
Ad

Assembling genomes using ABySS

  • 1. Assembling genomes using ABySS dnGASP 2011 Shaun Jackman BC Genome Sciences Centre sjackman@bcgsc.ca abyss-users@bcgsc.ca
  • 2. An assembly in two stages ● Stage I: Sequence assembly algorithm ● Stage II: Paired-end assembly algorithm 2
  • 3. Stage 1 Sequence assembly algorithm ● Load the reads, Load k-mers breaking each read into k-mers ● Find adjacent k-mers, which Find overlaps overlap by k-1 bases ● Remove k-mers resulting from Prune tips read errors ● Remove variant sequences Pop bubbles ● Generate contigs Generate contigs 3
  • 4. Load the reads ● For each input read of length l, (l - k + 1) k-mers are generated by sliding a window of length k over the read Read (l = 12): ● Each k-mer is a vertex of ATCATACATGAT the de Bruijn graph k-mers (k = 9): ATCATACAT ●Two adjacent k-mers are TCATACATG an edge of the de Bruijn CATACATGA ATACATGAT graph 4
  • 5. De Bruijn Graph ● A simple graph for k = 5 ● Two reads – GGACATC – GGACAGA GACAT ACATC GGACA GACAG ACAGA 5
  • 6. Pruning tips ● Read errors cause tips 6
  • 7. Pruning tips ● Read errors cause tips ● Pruning tips removes the erroneous reads from the assembly 7
  • 8. Popping bubbles ● Variant sequences cause bubbles ● Popping bubbles removes the variant sequence from the assembly ● Repeat sequences with small differences also cause bubbles 8
  • 9. Assemble contigs ● Remove ambiguous edges ● Output contigs in FASTA format 9
  • 10. Paired-end assembly algorithm Stage 2 ● Align the reads to the contigs of the first stage ● Generate an empirical fragment-size distribution using the paired reads that align to the same contig ● Estimate the distance between contigs using the paired reads that align to different contigs 10
  • 11. Align the reads to the contigs KAligner ● Every k-mer in the single-end assembly is unique ● KAligner can map reads with k consecutive correct bases ● ABySS may use other aligners, including BWA and bowtie 11
  • 12. Empirical fragment-size distribution ParseAligns ● Generate an empirical fragment-size distribution using the paired reads that align to the same contig 12
  • 13. Estimate distances between contigs DistanceEst ● Estimate the distance between contigs using the paired reads that align to different contigs d = 25 ± 8 d=3±5 d=6±5 d=4±3 13
  • 14. Maximum likelihood estimator DistanceEst ● Use the empirical paired- end size distribution ● Maximize the likelihood function ● Find the most likely distance between the two contigs 14
  • 15. Paired-end algorithm continued... ● Find paths through the contig adjacency graph that agree with Generate paths the distance estimates ● Merge overlapping paths Merge paths ● Merge the contigs in these paths Generate contigs and output the FASTA file 15
  • 16. Find consistent paths SimpleGraph ● Find paths through the contig adjacency graph that agree with the distance estimates d=4±3 Actual distance = 3 16
  • 17. Merge overlapping paths MergePaths ● Merge paths that overlap 17
  • 18. Generate the FASTA output ● Merge the contigs in these paths. ● Output the FASTA file GATTTTTG GAC GTCTTGATCTT CAC GTATTG CTATT 18
  • 19. Assembly process ● Stage 1 completed in 3.5 hours ● Used 72 processors on six machines ● Peak memory usage of 180 GB of RAM ● Stage 2 completed in 9 hours ● Used 12 processors on one machine ● Peak memory usage of 48 GB of RAM ● Assembly parameters k=64 s=200 n=10 19
  • 20. Assembly results Level 1: 500-bp paired-end reads ● Assembled half the genome in 7,676 contigs larger than the N50 of 50,612 bp ● Assembled 1.81 Gbp in 170,407 contigs larger than 200 bp ● The largest contig is 1,158,576 bp ● Removed 1,296,819 variant sequences 20
  • 21. Alignments to the reference ● Aligned the 170,407 contigs longer than 200 bp ● 96.2% align at least 99% length ● 1.2% align between 90% and 99% length ● 2.5% align less than 90% length >99% 90-99% <90% 21
  • 22. Works in progress ● Replace complex variant sequences with Ns ● Scaffold over gaps and simple repeat sequence using large fragment mate-pair reads ● Filling in gaps with sequence using localized microassembly 22
  • 23. ABySS Publications IEEE InfoVis 2009
  • 24. Acknowledgments Supervisors ● İnanç Birol ● Steven Jones Team ● Readman Chiu ● Rod Docking ● Karen Mungall ● Jenny Qian 24
  • 25. 25