SlideShare a Scribd company logo
How to sequence a large eukaryotic genomeand how we sequenced the cod genomeLex NederbragtNorwegian High-Throughput Sequencing Centre (NSC)andCentre for Ecological and Evolutionary Synthesis (CEES)
How to sequence a large eukaryotic genome
What is a genome assembly?A hierarchical data structurethat maps the sequence datato a putative reconstruction of the target Miller et al 2010, Genomics 95 (6): 315-327
Hierarchical structure
Sequence dataReadshttp://www.cbcb.umd.edu/research/assembly_primer.shtml
Reads!http://www.sciencephoto.com/media/210915/enlarge
ContigsBuilding contigs
ContigsBuilding contigsRepeat copy 1Repeat copy 2Contig orienation?Contig order?Collapsed repeat consensus http://www.cbcb.umd.edu/research/assembly_primer.shtml
Mate pairsOther read typeRepeat copy 1Repeat copy 2(much) longer fragmentsmate pair reads
ScaffoldsOrdered, oriented contigsmate pairscontigsgap size estimate
Hierarchical structure
AlgorithmsAll are graph-basedRead 2Read 1OverlapGraph-theory!
AlgorithmsHamiltonian patha path that contains all the nodeshttp://www.cbcb.umd.edu/research/assembly_primer.shtml
AlgorithmsOverlap calculation (alignment)computationally intensiveRead 2Read 1Overlap
AlgorithmsPath through the graphcontigRead 2Read 3Read 4Read 1OverlapOverlapOverlap
Greedy extensionOldesthttp://www.cbcb.umd.edu/research/assembly_primer.shtml
Overlap-Layout-ConsensusTypical for Sanger-type readsalso used by newbler from 454 Life SciencesStepsOverlap computationLayout: graph simplificationConsensus: sequence
Overlap-Layout-ConsensusOverlap phase:K-mer seeds initiate overlapACGCGATTCAGGTTACCACG
de Bruijn graphsDeveloped outside of DNA-related workBest solution for very short reads   ≤100 ntGACCTACAGAC ACC  CCT   CTA    TAC     ACAReadde Bruijn graphK-mers (K=3)K-1 bases overlap
GraphsSchatz M C et al. Genome Res. 2010;20:1165-1173
GraphsSimplify the graphAdd scaffolding information
Sequence dataSequencing errorsadd complexity to graphcreate new k-mersCorrection of errorsk-mer frequencyKelley et al.Genome Biology 2010 11:R116
How to sequence a genomehuman	1990'scod 1		2009 - 2011cod 2		 2011 - 2012
Human genomePublic effortBAC-by-BAC sequencinghierarchical shotgun sequencingGenomeBACsSelect BACs100-150 kb shotgun sequencinghttp://www.cbcb.umd.edu/research/assembly_primer.shtml
Human genomeCelera: shotgun sequencingentire genome shotgunuse of mate pairs
How to sequence a genome   PreparationsBAC-by-BACAdd shotgunand mate pairs
The cod genome projectPreparations* From a different individual
Cod: strategy‘454 only’NO subcloningPure ‘shotgun’ approach454 specific paired end librariesSupplementaryBAC ends using Sanger sequencing
Cod: sequencing
Cod: assemblyInput for assembly84 million reads28 billion bases (Gb)34x coverageAssembly programNewbler from 454Celera from Venter Inst.Computing nodes24 cpus128 GB of memory
Cod: assembly611 Mb in 6 467 scaffoldsbut 35% gap basesshort contigsincomplete genes
Cod: gapsPolymorphiccontig 2HeterozygosityContig 4Contig 1Polymorphiccontig 3Short Tandem RepeatsACACACACACACACACACACACACACACACACACACACACACACACACACACACACACAACACACACACACACACACACACACACACACACACACACACACACACACACACACACACAACACACACACACACACACACACACACACACACACACACACACACACACACACACACACAACACACACACACACACACACACACACACACACACACACACACACACACACACACACACA
Cod: annotationEnsembl'repair' genes based on stickleback sequence~22 000 geneshttp://pre.ensembl.org/Gadus_morhua/
How to sequence a large eukaryotic genome
Cod 2: 2011-2012Close the gapsincrease contig sizePseudochromosomesgenetic linkage mapscaffolds to 'chromosomes'anchoringordering and orienting
Cod 2: strategyNew dataIllumina readslonger 454 reads ~700 basesPacBio reads?Improved programsnewblerNew programsassemblygap closing
Many programs to choose from
Assembly competitionsAssemblathon 1simulated datasetsALLPATHS_LG – Broad Institute MIT (US)Soapdenovo – BGI (China)SGA – Sanger Institute (UK)
Assembly competitionsAssemblathon 2real datasetssnake – Illumina onlycichlid fish – Illumina onlyparrotIllumina454 FLX+PacBiohttp://assemblathon.org/
How to sequence a genomeIn 2011Cheap alternative: RAD-tag sequencing
How to sequence a genomeFoundation of Illumina data100x coverage Paired End reads (2x100bp)several Mate Pair libraries2kb, 3kb, 8k, 10kb, bigger?this is now very cheap!Fill gaps with long reads454 or PacBio
How to sequence a genomeAdd lots of bioinformatics...http://cores.montana.edu/index.php?page=bioinformatics-core-facility
Thank you!lex.nederbragt@bio.uio.nowww.sequencing.uio.nowww.sequencing.uio.no

More Related Content

PPTX
Combining PacBio with short read technology for improved de novo genome assembly
PDF
Genome Assembly 2018
PDF
Genome Assembly
PPTX
Bioinfo ngs data format visualization v2
PPTX
Data Management for Quantitative Biology - Data sources (Next generation tech...
PPTX
Knowing Your NGS Upstream: Alignment and Variants
PDF
Basics of Genome Assembly
PPTX
Sequence assembly
Combining PacBio with short read technology for improved de novo genome assembly
Genome Assembly 2018
Genome Assembly
Bioinfo ngs data format visualization v2
Data Management for Quantitative Biology - Data sources (Next generation tech...
Knowing Your NGS Upstream: Alignment and Variants
Basics of Genome Assembly
Sequence assembly

What's hot (20)

PPTX
Introduction to second generation sequencing
PPT
Assembly and finishing
PDF
Rnaseq basics ngs_application1
PPTX
Toolbox for bacterial population analysis using NGS
PPTX
2011 jeroen vanhoudt_ngs
PDF
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
PPTX
A Comparison of NGS Platforms.
PPTX
RNASeq - Analysis Pipeline for Differential Expression
PPTX
Next Generation Sequencing - the basics
PDF
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
PPTX
Exome seuencing (steps, method, and applications)
PPTX
Whole genome sequencing of bacteria & analysis
PDF
BioChain Next Generation Sequencing Products
PPTX
Next Generation Sequencing (NGS)
PDF
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
PDF
next generation sequencing (recent collection2018)
PPT
20100516 bioinformatics kapushesky_lecture08
PDF
RNA sequencing: advances and opportunities
PPTX
Ngs introduction
PPTX
Future of metagenomics
Introduction to second generation sequencing
Assembly and finishing
Rnaseq basics ngs_application1
Toolbox for bacterial population analysis using NGS
2011 jeroen vanhoudt_ngs
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
A Comparison of NGS Platforms.
RNASeq - Analysis Pipeline for Differential Expression
Next Generation Sequencing - the basics
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
Exome seuencing (steps, method, and applications)
Whole genome sequencing of bacteria & analysis
BioChain Next Generation Sequencing Products
Next Generation Sequencing (NGS)
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
next generation sequencing (recent collection2018)
20100516 bioinformatics kapushesky_lecture08
RNA sequencing: advances and opportunities
Ngs introduction
Future of metagenomics
Ad

Similar to How to sequence a large eukaryotic genome (20)

PDF
Genome assembly: the art of trying to make one big thing from millions of ver...
PPTX
Genome Assembly copy
PDF
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
PPTX
from genome sequencing to genome assembly
PPTX
2014 marine-microbes-grc
PPTX
HPCAC - the state of bioinformatics in 2017
PDF
Genome assembly: then and now — v1.1
PPTX
Bio Informatics - Genome Assembly
PPTX
CROP GENOME SEQUENCING
PPTX
Church_GenomeAccess_2013_genome2013
PPTX
20150601 bio sb_assembly_course
PPTX
U Florida / Gainesville talk, apr 13 2011
PDF
Alignment Approaches II: Long Reads
PPTX
Ngs de novo assembly progresses and challenges
PDF
Wellcome Trust Advances Course: NGS Course - Lecture1
PDF
40 Years of Genome Assembly: Are We Done Yet?
PPT
2013 pag-equine-workshop
PPTX
2015 ohsu-metagenome
PDF
Genome Big Data
PPT
sequencing of genome
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome Assembly copy
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
from genome sequencing to genome assembly
2014 marine-microbes-grc
HPCAC - the state of bioinformatics in 2017
Genome assembly: then and now — v1.1
Bio Informatics - Genome Assembly
CROP GENOME SEQUENCING
Church_GenomeAccess_2013_genome2013
20150601 bio sb_assembly_course
U Florida / Gainesville talk, apr 13 2011
Alignment Approaches II: Long Reads
Ngs de novo assembly progresses and challenges
Wellcome Trust Advances Course: NGS Course - Lecture1
40 Years of Genome Assembly: Are We Done Yet?
2013 pag-equine-workshop
2015 ohsu-metagenome
Genome Big Data
sequencing of genome
Ad

More from Lex Nederbragt (12)

PPTX
Coding & Best Practice in Programming in the NGS era
PPTX
Why of version control
PPTX
Assembly: before and after
PPTX
Improving and validating the Atlantic Cod genome assembly using PacBio
PPTX
Repeat after me: Is our research reproducible (enough)?
PPTX
A different kettle of fish entirely: bioinformatic challenges and solutions f...
PPTX
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
PPTX
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
PPTX
How and why I use blogging
PPTX
Assembly of metagenomes
PPTX
NGS techniques and data
PPTX
NGS: bioinformatic challenges
Coding & Best Practice in Programming in the NGS era
Why of version control
Assembly: before and after
Improving and validating the Atlantic Cod genome assembly using PacBio
Repeat after me: Is our research reproducible (enough)?
A different kettle of fish entirely: bioinformatic challenges and solutions f...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
How and why I use blogging
Assembly of metagenomes
NGS techniques and data
NGS: bioinformatic challenges

Recently uploaded (20)

PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Cloud computing and distributed systems.
PDF
Electronic commerce courselecture one. Pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Machine Learning_overview_presentation.pptx
PDF
cuic standard and advanced reporting.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
sap open course for s4hana steps from ECC to s4
Cloud computing and distributed systems.
Electronic commerce courselecture one. Pdf
Empathic Computing: Creating Shared Understanding
Mobile App Security Testing_ A Comprehensive Guide.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Dropbox Q2 2025 Financial Results & Investor Presentation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectroscopy.pptx food analysis technology
Machine Learning_overview_presentation.pptx
cuic standard and advanced reporting.pdf
Programs and apps: productivity, graphics, security and other tools
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Per capita expenditure prediction using model stacking based on satellite ima...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

How to sequence a large eukaryotic genome

Editor's Notes

  • #17: Greedy assemblers - The first assembly programs followed a simple but effective strategy in which the assembler greedily joins together the reads that are most similar to each other.  An example is shown in Figure 8, where the assembler joins, in order,  reads 1 and 2 (overlap = 200 bp), then reads 3 and 4 (overlap  = 150 bp), then reads 2 and 3 (overlap = 50 bp) thereby creating a single contig from the four reads provided in the input.  One disadvantage of the simple greedy approach is that because local information is considered at each step, the assembler can be easily confused by complex repeats, leading to mis-assemblies.
  • #25: BAC-by-BAC approach.  The long lines represent individual BACs.  The minimal tiling path is represented by thick lines.  Each BAC in the tiling path is then sequenced through the shotgun method.