SlideShare a Scribd company logo
De novo assembly and
genotyping of variants using
colored de Bruijn graphs
Iqbal et al. 2012
Kolmogorov Mikhail 2013
Challenges
• Detecting genetic variants that are highly
divergent from a reference
• Detecting genetic variants between samples
when there is no available reference sequence
2 / 22
Mapping-based approaches limitations
● Sample may contain sequence absent or
divergent from the reference
● Reference sequences, particularly of higher
eukaryotes are incomplete, notably in telomeric
and pericentromeric regions
● Some samples may have no available reference
sequence
3 / 22
Cortex!
● De novo assmebler capable of assembling
multiple eukaryote genomes simultaneously
● Focus on detecting and characterising genetic
variation
● Alignment free
● Accuracy of variant calling may be improved by
adding reference sequence
4 / 22
Colored graph
● De Brujin graph with colored nodes and edges
● Different colors may reflect HTS data from
multiple samples, experiments, reference
sequences, known variant sequences
5 / 22
Variant calling aproaches
●Bubble calling
●Path divergence
●Multiple-sample analysis
●Genotyping
6 / 22
First approach: Bubble calling
●
Bubbles can be induced by variants, repeats or
sequencing errors
● Variants can be separated from repeats by
including reference sequence
● When multiple samples available, it`s possible to
estimate likelihood of bubble type
7 / 22
BC implementation
get_super_node(n, G):
path1 = traverse forward, until reach node with in/out degree != 1
path2 = traverse forward, until reach node with in/out degree != 1
return union(path1, path2)
bubble_caller(G):
for each node n in G:
if outdegree(n) == 2 and not_visited(n):
mark_as_visited(n)
(<n1,e1>,<n2,e2>) = get_outedges(n, G)
path1 = get_super_node(n1, G)
path2 = get_super_node(n2, G)
mark_as_visited(path1[0] .. path1[-1])
mark_as_visited(path2[0] .. path2[-1])
if path1[-1] == path2[-1] and
orientation(path1[-1] == orientation(path2[-1]):
bubble_found = true
8 / 22
Second approach: Path divergence
● Complex variants are unlikely to generate clean
bubbles
● In some cases (particularly deletions), path
complexity is restricted to reference allele
● Such cases may be identified by following reference
sequence and track places, where it diverges from
sample graph
9 / 22
PD implementation
path_devergence(L, M, ref):
for i in [0 .. len(ref)]:
s = get_supernode_in_sample_color(n) #may be null
b = get_first_position_differ(ref,s)
if s != null and b > 0 and
s[b–L–1+j] == n[i+j] for j = [0 .. L-1]:
for j = [i+L+1 .. i+M]:
if s[len(s) – L+k] == ref[j+k] for k = [0 .. L-1]:
variant_found = true
break
10 / 22
Third approach: Multiple-sample analysis
●The joint analysis of HTS data from multiple samples
can improve the accuracy and false discovery rate of
variant detection substantially
●Maintaining separate colors for each sample gives an
additional information about whether a bubble is likely to
be induced by repeats or errors
●Approach can be used when there is no suitable
reference
11 / 22
Classification of graph structures
● Given colored de Brujin graph, with each color
representing single diploid individual from single
population
● Need to classify pair of paths (for example,
bubble branches) as either alleles of variant,
repeats or errors
12 / 22
Information sources
● Total coverage of two sides
● Average allele-balance (the distribution of
coverages between the two branches)
● Variance of allele-balance across samples
13 / 22
Structure differences
●Repeat: normal coverage, intermediate balance,
no variance between samples.
●Variance: normal coverage, but each individual
has a 0, 1 or 2 copy of allele 1 (and 2, 1 or 0 of
allele 2).
●Error: low coverage on error branch
14 / 22
Models
●Coverage modelles as over-dispersed Poisson
distribution
●Variant: Binominal distribution for allele-balance
●Repeat / Error: Beta distribution for allele-balance
●A site is classified as Variant / Repeat / Error if the log-
likelihood of the data under that model is at least 10
greater than the other models (log10
Bayess Factor at
least 10)
15 / 22
Fourth approach: Genotyping
●Colored de Bruijn graphs can be used to genotype samples
at known loci even when coverage is insufficient to enable
variant assembly
●Graph is constructed of the reference sequence, known
allelic variants and data from the sample
●The likelihood of possible genotype is calculated accounting
for the graph structure
16 / 22
Genotyping algorithm
●We have a graph with one color for each known allele,
one color for reference genome (with site in question
named X) and one color for sample
●For any pair of alleles γ1
, γ2
we define a likelihood
function
17 / 22
Likelihood function
●Ignore all segments of alleles, which are present in reference
minus X
●Decompose both alleles into sections, that are shared (si
) and
uniqe (ui
) and count nuber of reads in each
●Count number of nodes n in both sample graph and joint graph of
all known alleles except γ1
and γ2
— these are sequencing errors
●For each section of length li probability of ri
reads arriwing within
the section is given by a Poisson distribution with rate θi
= λli
on
shared segments and θi
/ 2 on unique
18 / 22
Colored graph implementation
● Graph encoded impicitly, within a hash table
● Kmers and their reverse complements are stored in
a same node
● Hash table value is array of integers, representing
coverage for each color
● For each kmer, one binary flag is used for each
nucleotide to track, if corresponding edge is present
● Memory usage per node: 8(k / 32) + 5c + 1 bytes
19 / 22
Use cases
● Variant calling in a high coverage human genome
● Detection of novel sequence from population
graphs of low-coverage samples
● Using population information to classify bubble
structures
● Genotyping simple and complex variants
20 / 22
Limitations
● Not using paired reads information
● Greater need for error correction, as kmer size
increases
● Potential of a graph explosion, as more
individuals are included in the graph
21 / 22
Thanks for attention
22 / 22

More Related Content

PPTX
Assembly: before and after
PPTX
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
PDF
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
PDF
Gene expression introduction
PDF
2015 12-09 nmdd
PDF
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
PPTX
Parks kmer metagenomics
PPTX
REGULATION OF GENE EXPRESSION IN PROKARYOTES & EUKARYOTES
Assembly: before and after
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Gene expression introduction
2015 12-09 nmdd
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
Parks kmer metagenomics
REGULATION OF GENE EXPRESSION IN PROKARYOTES & EUKARYOTES

More from BioinformaticsInstitute (20)

PPTX
PDF
Nanopores sequencing
PDF
A superglue for string comparison
PDF
Comparative Genomics and de Bruijn graphs
PDF
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
PPTX
Вперед в прошлое. Методы генетической диагностики древней днк
PDF
Knime &amp; bioinformatics
PDF
"Зачем биологам суперкомпьютеры", Александр Предеус
PDF
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
PDF
Рак 101 (Мария Шутова, ИоГЕН РАН)
PDF
Плюрипотентность 101
PDF
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
PPTX
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
PPT
Biodb 2011-everything
PPT
PPT
PPT
PPT
PPT
Nanopores sequencing
A superglue for string comparison
Comparative Genomics and de Bruijn graphs
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
Вперед в прошлое. Методы генетической диагностики древней днк
Knime &amp; bioinformatics
"Зачем биологам суперкомпьютеры", Александр Предеус
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Рак 101 (Мария Шутова, ИоГЕН РАН)
Плюрипотентность 101
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Biodb 2011-everything
Ad

Recently uploaded (20)

PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Encapsulation theory and applications.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Getting Started with Data Integration: FME Form 101
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
A Presentation on Artificial Intelligence
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
TLE Review Electricity (Electricity).pptx
Group 1 Presentation -Planning and Decision Making .pptx
Assigned Numbers - 2025 - Bluetooth® Document
SOPHOS-XG Firewall Administrator PPT.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
WOOl fibre morphology and structure.pdf for textiles
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Tartificialntelligence_presentation.pptx
A comparative analysis of optical character recognition models for extracting...
Encapsulation theory and applications.pdf
Enhancing emotion recognition model for a student engagement use case through...
A comparative study of natural language inference in Swahili using monolingua...
Getting Started with Data Integration: FME Form 101
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
A Presentation on Artificial Intelligence
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
OMC Textile Division Presentation 2021.pptx
Ad

Slides5

  • 1. De novo assembly and genotyping of variants using colored de Bruijn graphs Iqbal et al. 2012 Kolmogorov Mikhail 2013
  • 2. Challenges • Detecting genetic variants that are highly divergent from a reference • Detecting genetic variants between samples when there is no available reference sequence 2 / 22
  • 3. Mapping-based approaches limitations ● Sample may contain sequence absent or divergent from the reference ● Reference sequences, particularly of higher eukaryotes are incomplete, notably in telomeric and pericentromeric regions ● Some samples may have no available reference sequence 3 / 22
  • 4. Cortex! ● De novo assmebler capable of assembling multiple eukaryote genomes simultaneously ● Focus on detecting and characterising genetic variation ● Alignment free ● Accuracy of variant calling may be improved by adding reference sequence 4 / 22
  • 5. Colored graph ● De Brujin graph with colored nodes and edges ● Different colors may reflect HTS data from multiple samples, experiments, reference sequences, known variant sequences 5 / 22
  • 6. Variant calling aproaches ●Bubble calling ●Path divergence ●Multiple-sample analysis ●Genotyping 6 / 22
  • 7. First approach: Bubble calling ● Bubbles can be induced by variants, repeats or sequencing errors ● Variants can be separated from repeats by including reference sequence ● When multiple samples available, it`s possible to estimate likelihood of bubble type 7 / 22
  • 8. BC implementation get_super_node(n, G): path1 = traverse forward, until reach node with in/out degree != 1 path2 = traverse forward, until reach node with in/out degree != 1 return union(path1, path2) bubble_caller(G): for each node n in G: if outdegree(n) == 2 and not_visited(n): mark_as_visited(n) (<n1,e1>,<n2,e2>) = get_outedges(n, G) path1 = get_super_node(n1, G) path2 = get_super_node(n2, G) mark_as_visited(path1[0] .. path1[-1]) mark_as_visited(path2[0] .. path2[-1]) if path1[-1] == path2[-1] and orientation(path1[-1] == orientation(path2[-1]): bubble_found = true 8 / 22
  • 9. Second approach: Path divergence ● Complex variants are unlikely to generate clean bubbles ● In some cases (particularly deletions), path complexity is restricted to reference allele ● Such cases may be identified by following reference sequence and track places, where it diverges from sample graph 9 / 22
  • 10. PD implementation path_devergence(L, M, ref): for i in [0 .. len(ref)]: s = get_supernode_in_sample_color(n) #may be null b = get_first_position_differ(ref,s) if s != null and b > 0 and s[b–L–1+j] == n[i+j] for j = [0 .. L-1]: for j = [i+L+1 .. i+M]: if s[len(s) – L+k] == ref[j+k] for k = [0 .. L-1]: variant_found = true break 10 / 22
  • 11. Third approach: Multiple-sample analysis ●The joint analysis of HTS data from multiple samples can improve the accuracy and false discovery rate of variant detection substantially ●Maintaining separate colors for each sample gives an additional information about whether a bubble is likely to be induced by repeats or errors ●Approach can be used when there is no suitable reference 11 / 22
  • 12. Classification of graph structures ● Given colored de Brujin graph, with each color representing single diploid individual from single population ● Need to classify pair of paths (for example, bubble branches) as either alleles of variant, repeats or errors 12 / 22
  • 13. Information sources ● Total coverage of two sides ● Average allele-balance (the distribution of coverages between the two branches) ● Variance of allele-balance across samples 13 / 22
  • 14. Structure differences ●Repeat: normal coverage, intermediate balance, no variance between samples. ●Variance: normal coverage, but each individual has a 0, 1 or 2 copy of allele 1 (and 2, 1 or 0 of allele 2). ●Error: low coverage on error branch 14 / 22
  • 15. Models ●Coverage modelles as over-dispersed Poisson distribution ●Variant: Binominal distribution for allele-balance ●Repeat / Error: Beta distribution for allele-balance ●A site is classified as Variant / Repeat / Error if the log- likelihood of the data under that model is at least 10 greater than the other models (log10 Bayess Factor at least 10) 15 / 22
  • 16. Fourth approach: Genotyping ●Colored de Bruijn graphs can be used to genotype samples at known loci even when coverage is insufficient to enable variant assembly ●Graph is constructed of the reference sequence, known allelic variants and data from the sample ●The likelihood of possible genotype is calculated accounting for the graph structure 16 / 22
  • 17. Genotyping algorithm ●We have a graph with one color for each known allele, one color for reference genome (with site in question named X) and one color for sample ●For any pair of alleles γ1 , γ2 we define a likelihood function 17 / 22
  • 18. Likelihood function ●Ignore all segments of alleles, which are present in reference minus X ●Decompose both alleles into sections, that are shared (si ) and uniqe (ui ) and count nuber of reads in each ●Count number of nodes n in both sample graph and joint graph of all known alleles except γ1 and γ2 — these are sequencing errors ●For each section of length li probability of ri reads arriwing within the section is given by a Poisson distribution with rate θi = λli on shared segments and θi / 2 on unique 18 / 22
  • 19. Colored graph implementation ● Graph encoded impicitly, within a hash table ● Kmers and their reverse complements are stored in a same node ● Hash table value is array of integers, representing coverage for each color ● For each kmer, one binary flag is used for each nucleotide to track, if corresponding edge is present ● Memory usage per node: 8(k / 32) + 5c + 1 bytes 19 / 22
  • 20. Use cases ● Variant calling in a high coverage human genome ● Detection of novel sequence from population graphs of low-coverage samples ● Using population information to classify bubble structures ● Genotyping simple and complex variants 20 / 22
  • 21. Limitations ● Not using paired reads information ● Greater need for error correction, as kmer size increases ● Potential of a graph explosion, as more individuals are included in the graph 21 / 22