Article “Filtering duplicate reads from 454 pyrosequencing
data“
Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen
Reporter: Nadiya Sitdykova
11.05.2013
Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 1 / 12
Motivation
Artificially duplicated reads - duplicated reads, which don’t arise naturally, i.e. by
chance trough sampling DNA molecules that start at identical positions or in
rprtetive regions of a genome.
lead to incorrect conclusions about coverage of read.
Extremely sensetive to arificially duplicated reads applications of sequencing:
de novo whole-genome sequencing
metagenomics
metatranscriptomics
Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 2 / 12
Sources of artificial duplicates
Emulsion PCR
Droplets containing several empty beads and a single DNA molecule results
in loading these beads with identical copies of the original DNA molecule.
Background amplicon contamination
Avoided by preventing cross-contamination of sequencing library samples.
Signal cross-talk on the PTP sequencing device
With the launch of the 454 Titanium chemistry, well cross-talk has been
mimimized by metal coating of the PTP well surface.
Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 3 / 12
Flow space
The native output format of 454 pyrosequencing is the binare standard flowgram
format (*.sff). It contains the flowgram for each read, whereby each flowgram
consists of a wequence of flow values representing base incorporations.
Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 4 / 12
Algorithm
Novelty - operate only in flow space.
Preclustering Create subsets of data: flowgrams from different subsets
cannot be identified as duplicates of each other.
Use a varying seed of at least 8 flows, starting with the first flow
Take into account if the flow value was ’negative’(<0.5) or ’positive’(≥ 0.5)
Require flowgrams in one precluster to start with the same homopolymer
length
For preclusters containing >2000 flowgrams gradually increase seed and split
them up
Hierarchical clustering Iterate through the files with the preclustered
flowgrams and perform agglomerative clustering on one file at a time.
Start: each cluster contain one flowgram. Calculate all pairwise distances
between flowgrams.
Two clusters with with smallest distance
Determine consensus: calculate the per-flow median of flow values from all
flowgrams in this cluster (quality-trimmed regions only).
Update distance between new cluster and all other clusters
Stop: smallest distance exceed a given stringency threshold
Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 5 / 12
Distance between flowgrams
Probability for a homopolymer length being equal to h when observing a flow
value f :
P(h|f ) =
P(f |h) · P(h)
P(f )
Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 6 / 12
Distance between flowgrams
Assume that flowgrams fga and fgb are independent. Probability that the
homopolymer lengths, hai and hbi , are equal, given two flow values, fai and fbi :
P(hai = hbi |fai , fbi ) :=



1, if fai or fbi > 5.5;
1, if fai and fbi > 2.5;
5
k=0 P(hai = k|fai ) · P(hbi = k|fbi ), else.
Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 7 / 12
Distance between flowgrams
We take into accaount the 454 key and quality trimming information. So only
infomative flow values are used.
Assume that the flow values of one flowgram are not correlated. Define distance
between two flowgrams:
d(fga, fgb) := − log
m
i=l
P(hai = hbi |fai , fbi )
m − (l − 1)
=
m
i=l
− log
P(hai = hbi |fai , fbi )
m − (l − 1)
l = max{left trimpoint(fga), left trimpoint(fgb)},
m = min{400, right trimpoint(fga), right trimpoint(fgb)}
We take into account the 454 key and quality trimming information.
Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 8 / 12
Benchmarking
Benchmarking dataset: 1270325 Dicentrarchuslabrax 454 GS FLX Titanium
reads. Map it to corresponding reference scaffolds.
Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 9 / 12
Benchmarking
The measure that compares two sets of cluster is Jaccard index:
Jaccard :=
a
a + b + c
a - flowgram pairs that are correctly identified as duplicates
b - flowgram pairs that are incorrectly not identified as duplicates
c - flowgram pairs that are incorrectly identified as duplicates
Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 10 / 12
Benchmarking
Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 11 / 12
Benchmarking
Additionally tested the effect of duplicate removal on assembly performance of the
E.coli genome:
Filtered reads assembled using Newbler.
Score resulting assemblies using Mauve assembly metrics.
Result: for both JATAC and cd-hit-454 the N50 increased from 106414bp to
126844bp.
Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 12 / 12

More Related Content

PPTX
blast bioinformatics
PDF
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
PPTX
Blast gp assignment
PPTX
PPTX
PDF
Bioalgo 2013-09-phylogenetics 0
PPT
Ph.D. Candidate Exam Slides
blast bioinformatics
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Blast gp assignment
Bioalgo 2013-09-phylogenetics 0
Ph.D. Candidate Exam Slides

More from BioinformaticsInstitute (20)

PPTX
PDF
Nanopores sequencing
PDF
A superglue for string comparison
PDF
Comparative Genomics and de Bruijn graphs
PDF
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
PPTX
Вперед в прошлое. Методы генетической диагностики древней днк
PDF
Knime &amp; bioinformatics
PDF
"Зачем биологам суперкомпьютеры", Александр Предеус
PDF
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
PDF
Рак 101 (Мария Шутова, ИоГЕН РАН)
PDF
Плюрипотентность 101
PDF
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
PPTX
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
PPT
Biodb 2011-everything
PPT
PPT
PPT
PPT
PPT
Nanopores sequencing
A superglue for string comparison
Comparative Genomics and de Bruijn graphs
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
Вперед в прошлое. Методы генетической диагностики древней днк
Knime &amp; bioinformatics
"Зачем биологам суперкомпьютеры", Александр Предеус
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Рак 101 (Мария Шутова, ИоГЕН РАН)
Плюрипотентность 101
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Biodb 2011-everything
Ad

Recently uploaded (20)

PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PDF
CloudStack 4.21: First Look Webinar slides
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
A review of recent deep learning applications in wood surface defect identifi...
PPTX
Modernising the Digital Integration Hub
DOCX
search engine optimization ppt fir known well about this
PPT
What is a Computer? Input Devices /output devices
PPTX
TEXTILE technology diploma scope and career opportunities
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
Flame analysis and combustion estimation using large language and vision assi...
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PPTX
The various Industrial Revolutions .pptx
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PPT
Geologic Time for studying geology for geologist
Credit Without Borders: AI and Financial Inclusion in Bangladesh
CloudStack 4.21: First Look Webinar slides
The influence of sentiment analysis in enhancing early warning system model f...
1 - Historical Antecedents, Social Consideration.pdf
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
Convolutional neural network based encoder-decoder for efficient real-time ob...
A review of recent deep learning applications in wood surface defect identifi...
Modernising the Digital Integration Hub
search engine optimization ppt fir known well about this
What is a Computer? Input Devices /output devices
TEXTILE technology diploma scope and career opportunities
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Getting started with AI Agents and Multi-Agent Systems
Flame analysis and combustion estimation using large language and vision assi...
OpenACC and Open Hackathons Monthly Highlights July 2025
Enhancing plagiarism detection using data pre-processing and machine learning...
The various Industrial Revolutions .pptx
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
A contest of sentiment analysis: k-nearest neighbor versus neural network
Geologic Time for studying geology for geologist
Ad

Slides -n._sitdykova_0

  • 1. Article “Filtering duplicate reads from 454 pyrosequencing data“ Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen Reporter: Nadiya Sitdykova 11.05.2013 Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 1 / 12
  • 2. Motivation Artificially duplicated reads - duplicated reads, which don’t arise naturally, i.e. by chance trough sampling DNA molecules that start at identical positions or in rprtetive regions of a genome. lead to incorrect conclusions about coverage of read. Extremely sensetive to arificially duplicated reads applications of sequencing: de novo whole-genome sequencing metagenomics metatranscriptomics Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 2 / 12
  • 3. Sources of artificial duplicates Emulsion PCR Droplets containing several empty beads and a single DNA molecule results in loading these beads with identical copies of the original DNA molecule. Background amplicon contamination Avoided by preventing cross-contamination of sequencing library samples. Signal cross-talk on the PTP sequencing device With the launch of the 454 Titanium chemistry, well cross-talk has been mimimized by metal coating of the PTP well surface. Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 3 / 12
  • 4. Flow space The native output format of 454 pyrosequencing is the binare standard flowgram format (*.sff). It contains the flowgram for each read, whereby each flowgram consists of a wequence of flow values representing base incorporations. Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 4 / 12
  • 5. Algorithm Novelty - operate only in flow space. Preclustering Create subsets of data: flowgrams from different subsets cannot be identified as duplicates of each other. Use a varying seed of at least 8 flows, starting with the first flow Take into account if the flow value was ’negative’(<0.5) or ’positive’(≥ 0.5) Require flowgrams in one precluster to start with the same homopolymer length For preclusters containing >2000 flowgrams gradually increase seed and split them up Hierarchical clustering Iterate through the files with the preclustered flowgrams and perform agglomerative clustering on one file at a time. Start: each cluster contain one flowgram. Calculate all pairwise distances between flowgrams. Two clusters with with smallest distance Determine consensus: calculate the per-flow median of flow values from all flowgrams in this cluster (quality-trimmed regions only). Update distance between new cluster and all other clusters Stop: smallest distance exceed a given stringency threshold Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 5 / 12
  • 6. Distance between flowgrams Probability for a homopolymer length being equal to h when observing a flow value f : P(h|f ) = P(f |h) · P(h) P(f ) Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 6 / 12
  • 7. Distance between flowgrams Assume that flowgrams fga and fgb are independent. Probability that the homopolymer lengths, hai and hbi , are equal, given two flow values, fai and fbi : P(hai = hbi |fai , fbi ) :=    1, if fai or fbi > 5.5; 1, if fai and fbi > 2.5; 5 k=0 P(hai = k|fai ) · P(hbi = k|fbi ), else. Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 7 / 12
  • 8. Distance between flowgrams We take into accaount the 454 key and quality trimming information. So only infomative flow values are used. Assume that the flow values of one flowgram are not correlated. Define distance between two flowgrams: d(fga, fgb) := − log m i=l P(hai = hbi |fai , fbi ) m − (l − 1) = m i=l − log P(hai = hbi |fai , fbi ) m − (l − 1) l = max{left trimpoint(fga), left trimpoint(fgb)}, m = min{400, right trimpoint(fga), right trimpoint(fgb)} We take into account the 454 key and quality trimming information. Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 8 / 12
  • 9. Benchmarking Benchmarking dataset: 1270325 Dicentrarchuslabrax 454 GS FLX Titanium reads. Map it to corresponding reference scaffolds. Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 9 / 12
  • 10. Benchmarking The measure that compares two sets of cluster is Jaccard index: Jaccard := a a + b + c a - flowgram pairs that are correctly identified as duplicates b - flowgram pairs that are incorrectly not identified as duplicates c - flowgram pairs that are incorrectly identified as duplicates Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 10 / 12
  • 11. Benchmarking Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 11 / 12
  • 12. Benchmarking Additionally tested the effect of duplicate removal on assembly performance of the E.coli genome: Filtered reads assembled using Newbler. Score resulting assemblies using Mauve assembly metrics. Result: for both JATAC and cd-hit-454 the N50 increased from 106414bp to 126844bp. Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 12 / 12