2013 caltech-edrn-talk

C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
May 1, 2013
ctb@msu.edu
Streaming approaches to reference-free variant
calling

Open, online science
Much of the software and approaches I’m talking
about today are available:
khmer software:
github.com/ged-lab/khmer/
Blog: http://ivory.idyll.org/blog/
Twitter: @ctitusbrown

Outline & Overview
 Motivation: lots of data; analyzed with “offline”
approaches.
 Reference-based vs reference-free approaches.
 Single-pass algorithms for lossy compression;
application to resequencing data.

Shotgun sequencing
It was the best of times, it was the wor
, it was the worst of times, it was the
isdom, it was the age of foolishness
mes, it was the age of wisdom, it was th
It was the best of times, it was the worst of times, it was
the age of wisdom, it was the age of foolishness
…but for lots and lots of fragments!

Sequencers produce errors
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness

Three basic problems
Resequencing, counting, and assembly.

Three basic problems
Resequencing & counting, and assembly.

Resequencing analysis
We know a reference genome, and want to find
variants (blue) in a background of errors (red)

Counting
We have a reference genome (or gene set) and
want to know how much we have. Think gene
expression/microarrays, copy number variation..

Noisy observations <->
information
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness

“Three types of data scientists.”
(Bob Grossman, U. Chicago, at XLDB 2012)
1. Your data gathering rate is slower than Moore’s
Law.
2. Your data gathering rate matches Moore’s Law.
3. Your data gathering rate exceeds Moore’s Law.

http://www.genome.gov/sequencingcosts/

“Three types of data scientists.”
1. Your data gathering rate is slower than Moore’s
Law.
=> Be lazy, all will work out.
2. Your data gathering rate matches Moore’s Law.
=> You need to write good software, but all will
work out.
3. Your data gathering rate exceeds Moore’s Law.
=> You need serious help.

Random sampling => deep sampling
needed
Typically 10-100x needed for robust recovery (300 Gbp for human)

Applications in cancer genomics
 Single-cell cancer genomics will advance:
 e.g. ~60-300 Gbp data for each of ~1000 tumor
cells.
 Infer phylogeny of tumor => mechanistic insight.
 Current approaches are computationally intensive
and data-heavy.

Current variant calling approach.
Map reads to
reference
"Pileup" and do variant
calling
Downstream
diagnostics

Drawbacks of reference-based
approaches
 Fairly narrowly defined heuristics.
 Allelic mapping bias: mapping biased towards
reference allele.
 Ignorant of “unexpected” novelty
 Indels, especially large indels, are often ignored.
 Structural variation is not easily retained or
recovered.
 True novelty discarded.
 Most implementations are multipass on big data.

Challenges
 Considerable amounts of noise in data (0.1-1%
error)
 Reference-based approaches have several
drawbacks.
 Dependent on quality/applicability of reference.
 Detection of true novelty (SNP vs indels; SVs)
problematic.
 => The first major data reduction step (variant
calling) is extremely lossy in terms of potential
information.

Raw data
(~10-100 GB) Analysis
"Information"
~1 GB
"Information"
"Information"
"Information"
"Information"
Database &
integration
Compression
(~2 GB)
A software & algorithms approach: can we develop
lossy compression approaches that
1. Reduce data size & remove errors => efficient
processing?
2. Retain all “information”? (think JPEG)
If so, then we can store only the compressed data for
later reanalysis.
Short answer is: yes, we can.

Raw data
(~10-100 GB) Analysis
"Information"
~1 GB
"Information"
"Information"
"Information"
"Information"
Database &
integration
Compression
(~2 GB)
Save in cold storage
Save for reanalysis,
investigation.

My lab at MSU:
Theoretical => applied solutions.
Theoretical advances
in data structures and
algorithms
Practically useful & usable
implementations, at scale.
Demonstrated
effectiveness on real data.

1. Time- and space-efficient k-mer
counting
To add element: increment associated counter at all hash locales
To get count: retrieve minimum counter across all hash locales
http://highlyscalable.wordpress.com/2012/0
5/01/probabilistic-structures-web-analytics-
data-mining/

1%
5%
15%10%
Pell et al., PNAS, 2012
2. Compressible assembly graphs
(NOVEL)

 Transcriptomes, microbial genomes incl MDA,
and most metagenomes can be assembled in
under 50 GB of RAM, with identical or improved
results.
 Core algorithm is single pass, “low” memory.
3. Online, streaming, lossy
compression.
(NOVEL)
Brown et al., arXiv, 2012

Digital normalization approach
A digital analog to cDNA library normalization, diginorm:
 Reference free.
 Is single pass: looks at each read only once;
 Does not “collect” the majority of errors;
 Keeps all low-coverage reads & retains all
information.
 Smooths out coverage of regions.

Can we apply this algorithmically efficient
technique to variants? Yes.
Single pass, reference free, tunable, streaming online varian

Align reads to assembly graph
Dr. Jason Pell

Reference-free variant calling.
Align read to graph
Novelty? Retain.
Downstream
diagnostics
Saturated? Count &
discard.
Output variant at
saturation (online).

Coverage is adjusted to retain signal

Reference-free variant calling
 Streaming & online algorithm; single pass.
 For real-time diagnostics, can be applied as bases are
emitted from sequencer.
 Reference free: independent of reference bias.
 Coverage of variants is adaptively adjusted to retain
all signal.
 Parameters are easily tuned, although theory needs
to be developed.
 High sensitivity (e.g. C=50 in 100x coverage) => poor
compression
 Low sensitivity (C=20) => good compression.
 Can “subtract” reference => novel structural variants.
 (See: Cortex, Zam Iqbal.)

Concluding thoughts
 This approach could provide significant and
substantial practical and theoretical leverage to
challenging problem.
 They provide a path to the future:
 Many-core implementation; distributable?
 Decreased memory footprint => cloud/rental computing
can be used for many analyses.
 Still early days, but funded…
 Our other techniques are in use, ~dozens of labs
using digital normalization.

References & reading list
 Iqbal et al., De novo assembly and genotyping of
variants using colored de Bruijn graphs. Nat. Gen
2012.
(PubMed 22231483)
 Nordstrom et al., Mutation identification by direct
comparison of whole-genome sequencing data
from mutant and wild-type individuals using k-
mers. Nat. Biotech 2013.
(PubMed 23475072)
 Brown et al., Reference-Free Algorithm for
Computational Normalization of Shotgun
Sequencing Data. arXiv 1203.4802
Note: this talk is online at slideshare.net, c.titus.brown.

Acknowledgements
Lab members involved Collaborators
 Adina Howe (w/Tiedje)
 Jason Pell
 Arend Hintze
 Rosangela Canino-Koning
 Qingpeng Zhang
 Elijah Lowe
 Likit Preeyanon
 Jiarong Guo
 Tim Brom
 Kanchan Pavangadkar
 Eric McDonald
 Chris Welcher
 Jim Tiedje, MSU
 Billie Swalla, UW
 Janet Jansson, LBNL
 Susannah Tringe, JGI
Funding
USDA NIFA; NSF IOS;
BEACON.
Thank you for the invitation!

2013 caltech-edrn-talk

More Related Content

What's hot (20)

Viewers also liked (13)

Similar to 2013 caltech-edrn-talk (20)

More from c.titus.brown (20)

2013 caltech-edrn-talk

Editor's Notes