SlideShare a Scribd company logo
C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
May 1, 2013
ctb@msu.edu
Streaming approaches to reference-free variant
calling
Open, online science
Much of the software and approaches I’m talking
about today are available:
khmer software:
github.com/ged-lab/khmer/
Blog: http://ivory.idyll.org/blog/
Twitter: @ctitusbrown
Outline & Overview
 Motivation: lots of data; analyzed with “offline”
approaches.
 Reference-based vs reference-free approaches.
 Single-pass algorithms for lossy compression;
application to resequencing data.
Shotgun sequencing
It was the best of times, it was the wor
, it was the worst of times, it was the
isdom, it was the age of foolishness
mes, it was the age of wisdom, it was th
It was the best of times, it was the worst of times, it was
the age of wisdom, it was the age of foolishness
…but for lots and lots of fragments!
Sequencers produce errors
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness
Three basic problems
Resequencing, counting, and assembly.
Three basic problems
Resequencing & counting, and assembly.
Resequencing analysis
We know a reference genome, and want to find
variants (blue) in a background of errors (red)
Counting
We have a reference genome (or gene set) and
want to know how much we have. Think gene
expression/microarrays, copy number variation..
Noisy observations <->
information
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness
“Three types of data scientists.”
(Bob Grossman, U. Chicago, at XLDB 2012)
1. Your data gathering rate is slower than Moore’s
Law.
2. Your data gathering rate matches Moore’s Law.
3. Your data gathering rate exceeds Moore’s Law.
http://www.genome.gov/sequencingcosts/
“Three types of data scientists.”
1. Your data gathering rate is slower than Moore’s
Law.
=> Be lazy, all will work out.
2. Your data gathering rate matches Moore’s Law.
=> You need to write good software, but all will
work out.
3. Your data gathering rate exceeds Moore’s Law.
=> You need serious help.
Random sampling => deep sampling
needed
Typically 10-100x needed for robust recovery (300 Gbp for human)
Applications in cancer genomics
 Single-cell cancer genomics will advance:
 e.g. ~60-300 Gbp data for each of ~1000 tumor
cells.
 Infer phylogeny of tumor => mechanistic insight.
 Current approaches are computationally intensive
and data-heavy.
Current variant calling approach.
Map reads to
reference
"Pileup" and do variant
calling
Downstream
diagnostics
Drawbacks of reference-based
approaches
 Fairly narrowly defined heuristics.
 Allelic mapping bias: mapping biased towards
reference allele.
 Ignorant of “unexpected” novelty
 Indels, especially large indels, are often ignored.
 Structural variation is not easily retained or
recovered.
 True novelty discarded.
 Most implementations are multipass on big data.
Challenges
 Considerable amounts of noise in data (0.1-1%
error)
 Reference-based approaches have several
drawbacks.
 Dependent on quality/applicability of reference.
 Detection of true novelty (SNP vs indels; SVs)
problematic.
 => The first major data reduction step (variant
calling) is extremely lossy in terms of potential
information.
Raw data
(~10-100 GB) Analysis
"Information"
~1 GB
"Information"
"Information"
"Information"
"Information"
Database &
integration
Compression
(~2 GB)
A software & algorithms approach: can we develop
lossy compression approaches that
1. Reduce data size & remove errors => efficient
processing?
2. Retain all “information”? (think JPEG)
If so, then we can store only the compressed data for
later reanalysis.
Short answer is: yes, we can.
Raw data
(~10-100 GB) Analysis
"Information"
~1 GB
"Information"
"Information"
"Information"
"Information"
Database &
integration
Compression
(~2 GB)
Save in cold storage
Save for reanalysis,
investigation.
My lab at MSU:
Theoretical => applied solutions.
Theoretical advances
in data structures and
algorithms
Practically useful & usable
implementations, at scale.
Demonstrated
effectiveness on real data.
1. Time- and space-efficient k-mer
counting
To add element: increment associated counter at all hash locales
To get count: retrieve minimum counter across all hash locales
http://highlyscalable.wordpress.com/2012/0
5/01/probabilistic-structures-web-analytics-
data-mining/
1%
5%
15%10%
Pell et al., PNAS, 2012
2. Compressible assembly graphs
(NOVEL)
 Transcriptomes, microbial genomes incl MDA,
and most metagenomes can be assembled in
under 50 GB of RAM, with identical or improved
results.
 Core algorithm is single pass, “low” memory.
3. Online, streaming, lossy
compression.
(NOVEL)
Brown et al., arXiv, 2012
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization approach
A digital analog to cDNA library normalization, diginorm:
 Reference free.
 Is single pass: looks at each read only once;
 Does not “collect” the majority of errors;
 Keeps all low-coverage reads & retains all
information.
 Smooths out coverage of regions.
Can we apply this algorithmically efficient
technique to variants? Yes.
Single pass, reference free, tunable, streaming online varian
Align reads to assembly graph
Dr. Jason Pell
Reference-free variant calling.
Align read to graph
Novelty? Retain.
Downstream
diagnostics
Saturated? Count &
discard.
Output variant at
saturation (online).
Coverage is adjusted to retain signal
Reference-free variant calling
 Streaming & online algorithm; single pass.
 For real-time diagnostics, can be applied as bases are
emitted from sequencer.
 Reference free: independent of reference bias.
 Coverage of variants is adaptively adjusted to retain
all signal.
 Parameters are easily tuned, although theory needs
to be developed.
 High sensitivity (e.g. C=50 in 100x coverage) => poor
compression
 Low sensitivity (C=20) => good compression.
 Can “subtract” reference => novel structural variants.
 (See: Cortex, Zam Iqbal.)
Concluding thoughts
 This approach could provide significant and
substantial practical and theoretical leverage to
challenging problem.
 They provide a path to the future:
 Many-core implementation; distributable?
 Decreased memory footprint => cloud/rental computing
can be used for many analyses.
 Still early days, but funded…
 Our other techniques are in use, ~dozens of labs
using digital normalization.
References & reading list
 Iqbal et al., De novo assembly and genotyping of
variants using colored de Bruijn graphs. Nat. Gen
2012.
(PubMed 22231483)
 Nordstrom et al., Mutation identification by direct
comparison of whole-genome sequencing data
from mutant and wild-type individuals using k-
mers. Nat. Biotech 2013.
(PubMed 23475072)
 Brown et al., Reference-Free Algorithm for
Computational Normalization of Shotgun
Sequencing Data. arXiv 1203.4802
Note: this talk is online at slideshare.net, c.titus.brown.
Acknowledgements
Lab members involved Collaborators
 Adina Howe (w/Tiedje)
 Jason Pell
 Arend Hintze
 Rosangela Canino-Koning
 Qingpeng Zhang
 Elijah Lowe
 Likit Preeyanon
 Jiarong Guo
 Tim Brom
 Kanchan Pavangadkar
 Eric McDonald
 Chris Welcher
 Jim Tiedje, MSU
 Billie Swalla, UW
 Janet Jansson, LBNL
 Susannah Tringe, JGI
Funding
USDA NIFA; NSF IOS;
BEACON.
Thank you for the invitation!

More Related Content

PPTX
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
PPTX
2014 nyu-bio-talk
PPTX
2013 alumni-webinar
PPTX
2013 duke-talk
PPTX
2013 stamps-intro-assembly
PPTX
2013 stamps-intro-assembly
PPTX
2014 bangkok-talk
PPTX
2014 sage-talk
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 nyu-bio-talk
2013 alumni-webinar
2013 duke-talk
2013 stamps-intro-assembly
2013 stamps-intro-assembly
2014 bangkok-talk
2014 sage-talk

What's hot (20)

PPTX
2012 hpcuserforum talk
PPTX
Data analysis & integration challenges in genomics
PPTX
2014 whitney-research
PPTX
2013 ucdavis-smbe-eukaryotes
PPTX
2014 davis-talk
PPTX
U Florida / Gainesville talk, apr 13 2011
PDF
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
PDF
Parallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea mays
PPTX
2013 siam-cse-big-data
PPTX
2014 ucl
PPT
Revised Bio 1wfx Recombinant D N A
PPTX
Variant (SNPs/Indels) calling in DNA sequences, Part 1
PDF
Genome size and adaptation in plants
PDF
Adaptive evolution of genome size across altitudinal clines in maize
PDF
Genome assembly: the art of trying to make one big thing from millions of ver...
PPTX
Using Supercomputers and Gene Sequencers to Discover Your Inner Microbiome
PDF
Managing Genomics Data at the Sanger Institute
PPTX
Supercomputing Your Inner Microbiome
PDF
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
PPTX
2013 hmp-assembly-webinar
2012 hpcuserforum talk
Data analysis & integration challenges in genomics
2014 whitney-research
2013 ucdavis-smbe-eukaryotes
2014 davis-talk
U Florida / Gainesville talk, apr 13 2011
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Parallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea mays
2013 siam-cse-big-data
2014 ucl
Revised Bio 1wfx Recombinant D N A
Variant (SNPs/Indels) calling in DNA sequences, Part 1
Genome size and adaptation in plants
Adaptive evolution of genome size across altitudinal clines in maize
Genome assembly: the art of trying to make one big thing from millions of ver...
Using Supercomputers and Gene Sequencers to Discover Your Inner Microbiome
Managing Genomics Data at the Sanger Institute
Supercomputing Your Inner Microbiome
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
2013 hmp-assembly-webinar
Ad

Viewers also liked (13)

PPTX
2015 opencon-webcast
DOCX
actividad 1.4
PPTX
Cost effective azure
PPTX
John saraguro diapositiva
PDF
SharePoint 2013 and the Consumerization of I.T.
PDF
[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick
PPTX
Moments Matter - Technology Transforming Consumer Behavior
PDF
The Art and Science of Pricing: Simple tools to align price with value (Rober...
PPTX
Engage in effective collaboration with Azure AD B2B
PDF
Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...
PDF
ProductCamp Boston 2016 Opening Slides
PDF
Internal, External and Digital Presence of the CEO is becoming more and more ...
PDF
Engage 2013 - Webtrends Streams - Technical
2015 opencon-webcast
actividad 1.4
Cost effective azure
John saraguro diapositiva
SharePoint 2013 and the Consumerization of I.T.
[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick
Moments Matter - Technology Transforming Consumer Behavior
The Art and Science of Pricing: Simple tools to align price with value (Rober...
Engage in effective collaboration with Azure AD B2B
Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...
ProductCamp Boston 2016 Opening Slides
Internal, External and Digital Presence of the CEO is becoming more and more ...
Engage 2013 - Webtrends Streams - Technical
Ad

Similar to 2013 caltech-edrn-talk (20)

PPTX
2014 nci-edrn
PPTX
2014 anu-canberra-streaming
PPTX
2012 talk to CSE department at U. Arizona
PPTX
2015 illinois-talk
PPTX
2013 talk at TGAC, November 4
PDF
Scalable Genome Analysis With ADAM
PDF
PacMin @ AMPLab All-Hands
PPTX
2015 vancouver-vanbug
PPTX
2014 manchester-reproducibility
PDF
CS176: Genome Assembly
PPTX
2014 nicta-reproducibility
PPTX
CS Guest Lecture 2015 10-05 advanced databases
PPTX
2014 toronto-torbug
PPTX
Workshop NGS data analysis - 2
PDF
Scalable Genome Analysis with ADAM
PDF
High Sensitivity Sanger Sequencing for Minor Variant Detection
PPT
2012 stamps-mbl-1
PDF
2013 stamps-assembly-methods.pptx
PPTX
CS Lecture 2017 04-11 from Data to Precision Medicine
PPT
Metagenomic analysis
2014 nci-edrn
2014 anu-canberra-streaming
2012 talk to CSE department at U. Arizona
2015 illinois-talk
2013 talk at TGAC, November 4
Scalable Genome Analysis With ADAM
PacMin @ AMPLab All-Hands
2015 vancouver-vanbug
2014 manchester-reproducibility
CS176: Genome Assembly
2014 nicta-reproducibility
CS Guest Lecture 2015 10-05 advanced databases
2014 toronto-torbug
Workshop NGS data analysis - 2
Scalable Genome Analysis with ADAM
High Sensitivity Sanger Sequencing for Minor Variant Detection
2012 stamps-mbl-1
2013 stamps-assembly-methods.pptx
CS Lecture 2017 04-11 from Data to Precision Medicine
Metagenomic analysis

More from c.titus.brown (20)

PPTX
2016 bergen-sars
PPTX
2016 davis-plantbio
PPTX
2016 davis-biotech
PPTX
2015 genome-center
PPTX
2015 beacon-metagenome-tutorial
PPTX
2015 aem-grs-keynote
PPTX
2015 msu-code-review
PPTX
2015 mcgill-talk
PPTX
2015 pycon-talk
PPTX
2015 osu-metagenome
PPTX
2015 ohsu-metagenome
PPTX
2015 balti-and-bioinformatics
PPTX
2015 pag-chicken
PPTX
2015 pag-metagenome
PPTX
2014 aus-agta
PPTX
2014 abic-talk
PPTX
2014 mmg-talk
PPTX
2014 wcgalp
PPTX
2014 moore-ddd
PPTX
2014 ismb-extra-slides
2016 bergen-sars
2016 davis-plantbio
2016 davis-biotech
2015 genome-center
2015 beacon-metagenome-tutorial
2015 aem-grs-keynote
2015 msu-code-review
2015 mcgill-talk
2015 pycon-talk
2015 osu-metagenome
2015 ohsu-metagenome
2015 balti-and-bioinformatics
2015 pag-chicken
2015 pag-metagenome
2014 aus-agta
2014 abic-talk
2014 mmg-talk
2014 wcgalp
2014 moore-ddd
2014 ismb-extra-slides

2013 caltech-edrn-talk

  • 1. C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University May 1, 2013 ctb@msu.edu Streaming approaches to reference-free variant calling
  • 2. Open, online science Much of the software and approaches I’m talking about today are available: khmer software: github.com/ged-lab/khmer/ Blog: http://ivory.idyll.org/blog/ Twitter: @ctitusbrown
  • 3. Outline & Overview  Motivation: lots of data; analyzed with “offline” approaches.  Reference-based vs reference-free approaches.  Single-pass algorithms for lossy compression; application to resequencing data.
  • 4. Shotgun sequencing It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for lots and lots of fragments!
  • 5. Sequencers produce errors It was the Gest of times, it was the wor , it was the worst of timZs, it was the isdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was th It was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishness It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
  • 6. Three basic problems Resequencing, counting, and assembly.
  • 7. Three basic problems Resequencing & counting, and assembly.
  • 8. Resequencing analysis We know a reference genome, and want to find variants (blue) in a background of errors (red)
  • 9. Counting We have a reference genome (or gene set) and want to know how much we have. Think gene expression/microarrays, copy number variation..
  • 10. Noisy observations <-> information It was the Gest of times, it was the wor , it was the worst of timZs, it was the isdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was th It was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishness It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
  • 11. “Three types of data scientists.” (Bob Grossman, U. Chicago, at XLDB 2012) 1. Your data gathering rate is slower than Moore’s Law. 2. Your data gathering rate matches Moore’s Law. 3. Your data gathering rate exceeds Moore’s Law.
  • 13. “Three types of data scientists.” 1. Your data gathering rate is slower than Moore’s Law. => Be lazy, all will work out. 2. Your data gathering rate matches Moore’s Law. => You need to write good software, but all will work out. 3. Your data gathering rate exceeds Moore’s Law. => You need serious help.
  • 14. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (300 Gbp for human)
  • 15. Applications in cancer genomics  Single-cell cancer genomics will advance:  e.g. ~60-300 Gbp data for each of ~1000 tumor cells.  Infer phylogeny of tumor => mechanistic insight.  Current approaches are computationally intensive and data-heavy.
  • 16. Current variant calling approach. Map reads to reference "Pileup" and do variant calling Downstream diagnostics
  • 17. Drawbacks of reference-based approaches  Fairly narrowly defined heuristics.  Allelic mapping bias: mapping biased towards reference allele.  Ignorant of “unexpected” novelty  Indels, especially large indels, are often ignored.  Structural variation is not easily retained or recovered.  True novelty discarded.  Most implementations are multipass on big data.
  • 18. Challenges  Considerable amounts of noise in data (0.1-1% error)  Reference-based approaches have several drawbacks.  Dependent on quality/applicability of reference.  Detection of true novelty (SNP vs indels; SVs) problematic.  => The first major data reduction step (variant calling) is extremely lossy in terms of potential information.
  • 19. Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB) A software & algorithms approach: can we develop lossy compression approaches that 1. Reduce data size & remove errors => efficient processing? 2. Retain all “information”? (think JPEG) If so, then we can store only the compressed data for later reanalysis. Short answer is: yes, we can.
  • 20. Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB) Save in cold storage Save for reanalysis, investigation.
  • 21. My lab at MSU: Theoretical => applied solutions. Theoretical advances in data structures and algorithms Practically useful & usable implementations, at scale. Demonstrated effectiveness on real data.
  • 22. 1. Time- and space-efficient k-mer counting To add element: increment associated counter at all hash locales To get count: retrieve minimum counter across all hash locales http://highlyscalable.wordpress.com/2012/0 5/01/probabilistic-structures-web-analytics- data-mining/
  • 23. 1% 5% 15%10% Pell et al., PNAS, 2012 2. Compressible assembly graphs (NOVEL)
  • 24.  Transcriptomes, microbial genomes incl MDA, and most metagenomes can be assembled in under 50 GB of RAM, with identical or improved results.  Core algorithm is single pass, “low” memory. 3. Online, streaming, lossy compression. (NOVEL) Brown et al., arXiv, 2012
  • 31. Digital normalization approach A digital analog to cDNA library normalization, diginorm:  Reference free.  Is single pass: looks at each read only once;  Does not “collect” the majority of errors;  Keeps all low-coverage reads & retains all information.  Smooths out coverage of regions.
  • 32. Can we apply this algorithmically efficient technique to variants? Yes. Single pass, reference free, tunable, streaming online varian
  • 33. Align reads to assembly graph Dr. Jason Pell
  • 34. Reference-free variant calling. Align read to graph Novelty? Retain. Downstream diagnostics Saturated? Count & discard. Output variant at saturation (online).
  • 35. Coverage is adjusted to retain signal
  • 36. Reference-free variant calling  Streaming & online algorithm; single pass.  For real-time diagnostics, can be applied as bases are emitted from sequencer.  Reference free: independent of reference bias.  Coverage of variants is adaptively adjusted to retain all signal.  Parameters are easily tuned, although theory needs to be developed.  High sensitivity (e.g. C=50 in 100x coverage) => poor compression  Low sensitivity (C=20) => good compression.  Can “subtract” reference => novel structural variants.  (See: Cortex, Zam Iqbal.)
  • 37. Concluding thoughts  This approach could provide significant and substantial practical and theoretical leverage to challenging problem.  They provide a path to the future:  Many-core implementation; distributable?  Decreased memory footprint => cloud/rental computing can be used for many analyses.  Still early days, but funded…  Our other techniques are in use, ~dozens of labs using digital normalization.
  • 38. References & reading list  Iqbal et al., De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Gen 2012. (PubMed 22231483)  Nordstrom et al., Mutation identification by direct comparison of whole-genome sequencing data from mutant and wild-type individuals using k- mers. Nat. Biotech 2013. (PubMed 23475072)  Brown et al., Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data. arXiv 1203.4802 Note: this talk is online at slideshare.net, c.titus.brown.
  • 39. Acknowledgements Lab members involved Collaborators  Adina Howe (w/Tiedje)  Jason Pell  Arend Hintze  Rosangela Canino-Koning  Qingpeng Zhang  Elijah Lowe  Likit Preeyanon  Jiarong Guo  Tim Brom  Kanchan Pavangadkar  Eric McDonald  Chris Welcher  Jim Tiedje, MSU  Billie Swalla, UW  Janet Jansson, LBNL  Susannah Tringe, JGI Funding USDA NIFA; NSF IOS; BEACON. Thank you for the invitation!

Editor's Notes

  • #3: Bad habit…
  • #25: Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression =&gt; OLC assembly.