SlideShare a Scribd company logo
Combining "overlap-layout-
        consensus" and de Brujin graph
        approaches for de novo genome
                  assembly
     Alexey Sergushichev, Anton Alexandrov, Sergey Kazakov,
        Sergey Melnikov, Vladislav Isenbaev, Fedor Tsarev
St. Petersburg State University of IT, Mechanics and Optics, Russia

                      In collaboration with:
          Egor Prokhortchouk and Ekaterina Khrameeva
                 Genoanalytica, Moscow, Russia

      Sequence Mapping and Assembly Assessment Project
                     dnGASP workshop
                 Barcelona, April 5th, 2011
Introduction
• Imagine you have two computers:
  – 24 core (Intel Xeon 2.40GHz), 24 GB RAM
  – 24 core (AMD Opteron 6174 2.2GHz), 64 GB
    RAM
• …But you don’t know about the second
  one ☺
• You are to assemble the genome from
  dnGASP contest

                                               2
Algorithm




            3
Errors Correction: Reads
              Truncation
• Scan each part of each PE-read from end until
  first base with quality less than 90%
• Truncate each part of each read on that position




                                                     4
Errors Correction: Frequency
             Analysis
• Consider all 30 character substrings of
  reads and reverse complements of them
• Calculate number of occurrences for each
  of these substrings
  – Occurs rarely – contains error (is untrusted)
  – Occurs frequently – is trusted
• Threshold for each case chosen manually


                                                    5
Errors Correction: Distribution
                 Curve
 3000000000

 2500000000

 2000000000

 1500000000

 1000000000

 500000000

         0
              1   3   5   7   9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47




• < 4 occurrences – untrusted
• Other 30-mers – trusted
                                                                                           6
Errors Correction: Buckets
• Memory:
  – Each substring stored as a 64-bit integer
  – Number of occurrences – 32-bit integer
  – ~6·109 distinct 30-mers in all PE-reads – 72Gb
• Split 30-mers to buckets according to their
  prefixes
• Prefix of length k → 4k buckets


                                                 7
Errors Correction
• Processing each bucket separately
• Consider some untrusted 30-mer
   – Try to change one base in it: (30-k)·3 ways
   – If only one resulting 30-mer is trusted, fix the corresponding read
• To fix error in prefix we can load 3k more buckets into
  RAM or...
• Not load – consider reverse complement of 30-mer

                     A G T A C A T


                     A T G T A C T
                                                                       8
Errors Correction: Results
• Used machine with 24 cores and 24 GB
  RAM for 24 hours
• Number of distinct 30-mers:
  – Before: 6 533 327 606
  – After: 3 911 459 530 (~40% less)
• Number of trusted 30-mers:
  – Before: 3 070 814 230
  – After: 3 369 674 264 (~10% more)
                                         9
Quasi-contigs Assembly
• Input = set of PE reads
• Goal is to fill the gap between ends


            From this picture…




                                         10
Quasi-contigs Assembly
                  …to this
     114                             114
                   AGCT...
                    ~500

• Construct de Brujin graph from reads
• Find paths between vertices corresponding to
  ends of reads – with brute-force algorithm

                                             11
T-Services Company
• Overall performance of cluster over 20 Tflops,
  based on:
   – 2 x AMD Opteron 6174 «Magny-Cours»
     2,2GHz 64 GB RAM DDR3 1333 MHz
   – 2 х Intel Xeon E5410 2.33 Ghz 16 Gb RAM
     DDR2 667 MHz
   – 2 х Intel Xeon E5450 3.0 Ghz 16 Gb RAM
     DDR2 667 MHz
• Provided exclusive access to node with 64 GB of
  RAM
                                                12
Quasi-Contigs Assembly
            Parameters
• Used machine with 24 cores and 64 GB of
  RAM for 20 hours
• Vertices – 30-mers
• Edges – trusted 31-mers
• Minimal length of quasi-contig – 334
• Maximal length of quasi-contig – 550


                                        13
Quasi-Contigs Assembly Results
• 67% of inserts restored to quasi-contigs:
  – ~27% – many ways to restore
  – ~6% – no way to restore




                                              14
Quasi-Contigs Assembly Results
  1,40E-02




             Pink – inserts lengths
  1,20E-02


             Blue – quasi-contigs lengths
  1,00E-02




  8,00E-03




  6,00E-03




  4,00E-03




  2,00E-03




  0,00E+00
         4

         2

         0

         8

         6

         4

         2

         0

         8

         6

         4

         2

         0

         8

         6

         4

         2

         0

         8

         6

         4

         2

         0

         8

         6

         4

         2

         0

         8

         6

         4

         2

         0

         8

         6

         4
       26

       27

       28

       28

       29

       30

       31

       32

       32

       33

       34

       35

       36

       36

       37

       38

       39

       40

       40

       41

       42

       43

       44

       44

       45

       46

       47

       48

       48

       49

       50

       51

       52

       52

       53

       54
                                            15
Contigs & Scaffolds Assembly
• Contigs assembly
  – Newbler
  – Used quasi-contigs from 24 files (of 88)
  – 60 hours
• Scaffolds assembly
  – AbySS
  – 40 hours per library


                                               16
Overall Results
               n     mean    N50     max       Sum

Newbler: A   401257 3694    7379    6279498   1.482e9

AbySS: A     422207 4635    12580   6279661   1.492e9

AbySS: B     417403 4808    22788   6279463   1.516e9

AbySS: C     526028 3647    14170   6279463   1.522e9

AbySS: D     580217 3275    8070    6279463   1.525e9


                                                   17
Work in Progress
• Develop a software module to replace
  Newbler (contig assembly from quasi-
  contigs)
• Develop a software module to replace
  AbySS for scaffold assembly
• Improve quality of quasi-contigs assembly
• Reduce RAM requirements

                                          18
Questions?




             19

More Related Content

PPTX
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
PPTX
Next-generation sequencing data format and visualization with ngs.plot 2015
PPT
Genome walking – a new strategy for identification of nucleotide sequence in ...
PDF
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
PDF
Fast algorithms for large scale genome alignment and comparison
PDF
Introduction to NGS
PDF
Genome Assembly
PDF
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Next-generation sequencing data format and visualization with ngs.plot 2015
Genome walking – a new strategy for identification of nucleotide sequence in ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
Fast algorithms for large scale genome alignment and comparison
Introduction to NGS
Genome Assembly
Genome Assembly: the art of trying to make one BIG thing from millions of ver...

Viewers also liked (20)

PDF
Overlap Layout Consensus assembly
PPTX
Доклад на семинаре в лаборатории алгоритмической биологии АУ
PDF
Overview of Genome Assembly Algorithms
PPTX
Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)
PDF
Sequencing, Alignment and Assembly
PPTX
Pyrosequencing 454
PDF
How to write bioinformatics software people will use and cite - t.seemann - ...
ODP
Molecular marker
DOCX
Introducción a SlideShare
PPTX
Trabajo de tecnologia
PDF
Prototyping: Helping to take away the suck
PPT
Drucker chapter 3
PPTX
Hekikai Steel Louvre Project
PDF
20111101 get social or get lost hortifair
PPT
Planning session for value chain case study
PPTX
Portfolio_Eberly
KEY
HTML5: A brave new world of markup
PPT
Google presentations
PDF
Y jmrxzmobile rsearch case study ver.final
KEY
Optimizing content for the "mobile web"
Overlap Layout Consensus assembly
Доклад на семинаре в лаборатории алгоритмической биологии АУ
Overview of Genome Assembly Algorithms
Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)
Sequencing, Alignment and Assembly
Pyrosequencing 454
How to write bioinformatics software people will use and cite - t.seemann - ...
Molecular marker
Introducción a SlideShare
Trabajo de tecnologia
Prototyping: Helping to take away the suck
Drucker chapter 3
Hekikai Steel Louvre Project
20111101 get social or get lost hortifair
Planning session for value chain case study
Portfolio_Eberly
HTML5: A brave new world of markup
Google presentations
Y jmrxzmobile rsearch case study ver.final
Optimizing content for the "mobile web"
Ad

Similar to Talk at dnGASP workshop, April 5, 2011 (20)

PDF
Scaling classical clone detection tools for ultra large datasets
PDF
Top 5 mistakes when writing Spark applications
PPTX
Stop-the-world GCs on milticores
PDF
Efficient Usage of Compute Shaders on Xbox One and PS4
PPTX
ACM 2013-02-25
PDF
[db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the pow...
PDF
Programming the Cell Processor A simple raytracer from pseudo-code to spu-code
PPTX
Computer System Architecture Lecture Note 8.1 primary Memory
PPTX
Project Slides for Website 2020-22.pptx
PPTX
GPU-Quicksort
PDF
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
PPTX
Oxford 05-oct-2012
PDF
Theta and the Future of Accelerator Programming
PDF
Moving Toward Deep Learning Algorithms on HPCC Systems
PDF
QCon London.pdf
PDF
Argonne's Theta Supercomputer Architecture
PDF
2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘
PDF
Applying your Convolutional Neural Networks
PDF
04 accelerating dl inference with (open)capi and posit numbers
PDF
POLARDB for MySQL - Parallel Query
Scaling classical clone detection tools for ultra large datasets
Top 5 mistakes when writing Spark applications
Stop-the-world GCs on milticores
Efficient Usage of Compute Shaders on Xbox One and PS4
ACM 2013-02-25
[db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the pow...
Programming the Cell Processor A simple raytracer from pseudo-code to spu-code
Computer System Architecture Lecture Note 8.1 primary Memory
Project Slides for Website 2020-22.pptx
GPU-Quicksort
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Oxford 05-oct-2012
Theta and the Future of Accelerator Programming
Moving Toward Deep Learning Algorithms on HPCC Systems
QCon London.pdf
Argonne's Theta Supercomputer Architecture
2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘
Applying your Convolutional Neural Networks
04 accelerating dl inference with (open)capi and posit numbers
POLARDB for MySQL - Parallel Query
Ad

More from Fedor Tsarev (12)

PPTX
We are the champions: programming world champions from Russia. Why and what for?
PPTX
Becoming a World Champion in Programming: Keep Calm and Compete
PPTX
Сборка генома: мифы и реальность. Доклад на пленарном заседании III Всероссий...
PDF
On NP-Hardness of the Paired de Bruijn Sound Cycle Problem
PPTX
Сборка генома de novo: мифы и реальность
PPT
Problem solving on acm international collegiate programming contest
PPT
05 динамическое программирование
PPT
05 динамическое программирование
PPT
04 динамическое программирование - основные концепции
PPT
01 линейные структуры данных
PPT
03 двоичные деревья поиска и очередь с приоритетами
PPT
02 сортировка и поиск
We are the champions: programming world champions from Russia. Why and what for?
Becoming a World Champion in Programming: Keep Calm and Compete
Сборка генома: мифы и реальность. Доклад на пленарном заседании III Всероссий...
On NP-Hardness of the Paired de Bruijn Sound Cycle Problem
Сборка генома de novo: мифы и реальность
Problem solving on acm international collegiate programming contest
05 динамическое программирование
05 динамическое программирование
04 динамическое программирование - основные концепции
01 линейные структуры данных
03 двоичные деревья поиска и очередь с приоритетами
02 сортировка и поиск

Recently uploaded (20)

PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Encapsulation theory and applications.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Mushroom cultivation and it's methods.pdf
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Hybrid model detection and classification of lung cancer
PDF
Approach and Philosophy of On baking technology
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
1. Introduction to Computer Programming.pptx
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
WOOl fibre morphology and structure.pdf for textiles
Web App vs Mobile App What Should You Build First.pdf
Encapsulation theory and applications.pdf
A comparative analysis of optical character recognition models for extracting...
Mushroom cultivation and it's methods.pdf
cloud_computing_Infrastucture_as_cloud_p
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
TLE Review Electricity (Electricity).pptx
Hybrid model detection and classification of lung cancer
Approach and Philosophy of On baking technology
Enhancing emotion recognition model for a student engagement use case through...
OMC Textile Division Presentation 2021.pptx
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
DP Operators-handbook-extract for the Mautical Institute
Building Integrated photovoltaic BIPV_UPV.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
1. Introduction to Computer Programming.pptx

Talk at dnGASP workshop, April 5, 2011

  • 1. Combining "overlap-layout- consensus" and de Brujin graph approaches for de novo genome assembly Alexey Sergushichev, Anton Alexandrov, Sergey Kazakov, Sergey Melnikov, Vladislav Isenbaev, Fedor Tsarev St. Petersburg State University of IT, Mechanics and Optics, Russia In collaboration with: Egor Prokhortchouk and Ekaterina Khrameeva Genoanalytica, Moscow, Russia Sequence Mapping and Assembly Assessment Project dnGASP workshop Barcelona, April 5th, 2011
  • 2. Introduction • Imagine you have two computers: – 24 core (Intel Xeon 2.40GHz), 24 GB RAM – 24 core (AMD Opteron 6174 2.2GHz), 64 GB RAM • …But you don’t know about the second one ☺ • You are to assemble the genome from dnGASP contest 2
  • 4. Errors Correction: Reads Truncation • Scan each part of each PE-read from end until first base with quality less than 90% • Truncate each part of each read on that position 4
  • 5. Errors Correction: Frequency Analysis • Consider all 30 character substrings of reads and reverse complements of them • Calculate number of occurrences for each of these substrings – Occurs rarely – contains error (is untrusted) – Occurs frequently – is trusted • Threshold for each case chosen manually 5
  • 6. Errors Correction: Distribution Curve 3000000000 2500000000 2000000000 1500000000 1000000000 500000000 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 • < 4 occurrences – untrusted • Other 30-mers – trusted 6
  • 7. Errors Correction: Buckets • Memory: – Each substring stored as a 64-bit integer – Number of occurrences – 32-bit integer – ~6·109 distinct 30-mers in all PE-reads – 72Gb • Split 30-mers to buckets according to their prefixes • Prefix of length k → 4k buckets 7
  • 8. Errors Correction • Processing each bucket separately • Consider some untrusted 30-mer – Try to change one base in it: (30-k)·3 ways – If only one resulting 30-mer is trusted, fix the corresponding read • To fix error in prefix we can load 3k more buckets into RAM or... • Not load – consider reverse complement of 30-mer A G T A C A T A T G T A C T 8
  • 9. Errors Correction: Results • Used machine with 24 cores and 24 GB RAM for 24 hours • Number of distinct 30-mers: – Before: 6 533 327 606 – After: 3 911 459 530 (~40% less) • Number of trusted 30-mers: – Before: 3 070 814 230 – After: 3 369 674 264 (~10% more) 9
  • 10. Quasi-contigs Assembly • Input = set of PE reads • Goal is to fill the gap between ends From this picture… 10
  • 11. Quasi-contigs Assembly …to this 114 114 AGCT... ~500 • Construct de Brujin graph from reads • Find paths between vertices corresponding to ends of reads – with brute-force algorithm 11
  • 12. T-Services Company • Overall performance of cluster over 20 Tflops, based on: – 2 x AMD Opteron 6174 «Magny-Cours» 2,2GHz 64 GB RAM DDR3 1333 MHz – 2 х Intel Xeon E5410 2.33 Ghz 16 Gb RAM DDR2 667 MHz – 2 х Intel Xeon E5450 3.0 Ghz 16 Gb RAM DDR2 667 MHz • Provided exclusive access to node with 64 GB of RAM 12
  • 13. Quasi-Contigs Assembly Parameters • Used machine with 24 cores and 64 GB of RAM for 20 hours • Vertices – 30-mers • Edges – trusted 31-mers • Minimal length of quasi-contig – 334 • Maximal length of quasi-contig – 550 13
  • 14. Quasi-Contigs Assembly Results • 67% of inserts restored to quasi-contigs: – ~27% – many ways to restore – ~6% – no way to restore 14
  • 15. Quasi-Contigs Assembly Results 1,40E-02 Pink – inserts lengths 1,20E-02 Blue – quasi-contigs lengths 1,00E-02 8,00E-03 6,00E-03 4,00E-03 2,00E-03 0,00E+00 4 2 0 8 6 4 2 0 8 6 4 2 0 8 6 4 2 0 8 6 4 2 0 8 6 4 2 0 8 6 4 2 0 8 6 4 26 27 28 28 29 30 31 32 32 33 34 35 36 36 37 38 39 40 40 41 42 43 44 44 45 46 47 48 48 49 50 51 52 52 53 54 15
  • 16. Contigs & Scaffolds Assembly • Contigs assembly – Newbler – Used quasi-contigs from 24 files (of 88) – 60 hours • Scaffolds assembly – AbySS – 40 hours per library 16
  • 17. Overall Results n mean N50 max Sum Newbler: A 401257 3694 7379 6279498 1.482e9 AbySS: A 422207 4635 12580 6279661 1.492e9 AbySS: B 417403 4808 22788 6279463 1.516e9 AbySS: C 526028 3647 14170 6279463 1.522e9 AbySS: D 580217 3275 8070 6279463 1.525e9 17
  • 18. Work in Progress • Develop a software module to replace Newbler (contig assembly from quasi- contigs) • Develop a software module to replace AbySS for scaffold assembly • Improve quality of quasi-contigs assembly • Reduce RAM requirements 18