Slides -n._sitdykova_0

Article “Filtering duplicate reads from 454 pyrosequencing
data“
Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen
Reporter: Nadiya Sitdykova
11.05.2013
Susanne Balzer, Ketil Malde, Markus A. Grohme and Inge Jonassen (Reporter: Nadiya Sitdykova)Article “Filtering duplicate reads from 454 pyrosequencing data“ 11.05.2013 1 / 12

Motivation
Artiﬁcially duplicated reads - duplicated reads, which don’t arise naturally, i.e. by
chance trough sampling DNA molecules that start at identical positions or in
rprtetive regions of a genome.
lead to incorrect conclusions about coverage of read.
Extremely sensetive to ariﬁcially duplicated reads applications of sequencing:
de novo whole-genome sequencing
metagenomics
metatranscriptomics

Sources of artiﬁcial duplicates
Emulsion PCR
Droplets containing several empty beads and a single DNA molecule results
in loading these beads with identical copies of the original DNA molecule.
Background amplicon contamination
Avoided by preventing cross-contamination of sequencing library samples.
Signal cross-talk on the PTP sequencing device
With the launch of the 454 Titanium chemistry, well cross-talk has been
mimimized by metal coating of the PTP well surface.

Flow space
The native output format of 454 pyrosequencing is the binare standard flowgram
format (*.sff). It contains the flowgram for each read, whereby each flowgram
consists of a wequence of flow values representing base incorporations.

Algorithm
Novelty - operate only in flow space.
Preclustering Create subsets of data: flowgrams from different subsets
cannot be identified as duplicates of each other.
Use a varying seed of at least 8 flows, starting with the first flow
Take into account if the flow value was ’negative’(<0.5) or ’positive’(≥ 0.5)
Require flowgrams in one precluster to start with the same homopolymer
length
For preclusters containing >2000 flowgrams gradually increase seed and split
them up
Hierarchical clustering Iterate through the files with the preclustered
flowgrams and perform agglomerative clustering on one file at a time.
Start: each cluster contain one flowgram. Calculate all pairwise distances
between flowgrams.
Two clusters with with smallest distance
Determine consensus: calculate the per-flow median of flow values from all
flowgrams in this cluster (quality-trimmed regions only).
Update distance between new cluster and all other clusters
Stop: smallest distance exceed a given stringency threshold

Distance between ﬂowgrams
Probability for a homopolymer length being equal to h when observing a ﬂow
value f :
P(h|f ) =
P(f |h) · P(h)
P(f )

Assume that ﬂowgrams fga and fgb are independent. Probability that the
homopolymer lengths, hai and hbi , are equal, given two ﬂow values, fai and fbi :
P(hai = hbi |fai , fbi ) :=



1, if fai or fbi > 5.5;
1, if fai and fbi > 2.5;
5
k=0 P(hai = k|fai ) · P(hbi = k|fbi ), else.

We take into accaount the 454 key and quality trimming information. So only
infomative flow values are used.
Assume that the flow values of one flowgram are not correlated. Define distance
between two flowgrams:
d(fga, fgb) := − log
m
i=l
P(hai = hbi |fai , fbi )
m − (l − 1)
=
m
i=l
− log
P(hai = hbi |fai , fbi )
m − (l − 1)
l = max{left trimpoint(fga), left trimpoint(fgb)},
m = min{400, right trimpoint(fga), right trimpoint(fgb)}
We take into account the 454 key and quality trimming information.

Benchmarking
Benchmarking dataset: 1270325 Dicentrarchuslabrax 454 GS FLX Titanium
reads. Map it to corresponding reference scaﬀolds.

Benchmarking
The measure that compares two sets of cluster is Jaccard index:
Jaccard :=
a
a + b + c
a - flowgram pairs that are correctly identified as duplicates
b - flowgram pairs that are incorrectly not identified as duplicates
c - flowgram pairs that are incorrectly identified as duplicates

Benchmarking

Benchmarking
Additionally tested the eﬀect of duplicate removal on assembly performance of the
E.coli genome:
Filtered reads assembled using Newbler.
Score resulting assemblies using Mauve assembly metrics.
Result: for both JATAC and cd-hit-454 the N50 increased from 106414bp to
126844bp.

Slides -n._sitdykova_0

More Related Content

More from BioinformaticsInstitute (20)

Recently uploaded (20)

Slides -n._sitdykova_0