SlideShare a Scribd company logo
PacMin: rethinking genome 
analysis with long reads 
Frank Austin Nothaft, AMPLab 
Joint work with Adam Bloniarz 
10/14/2014
Note: 
• This talk is mostly speculative. 
• I.e., the methods we’ll talk about are 
partially* implemented. 
• This means you have an opportunity to steer the 
direction of this work! 
* I’m being generous to myself.
Sequencing 101 
• Most sequence data today comes from Illumina 
machines, which perform sequencing-by-synthesis 
! 
! 
! 
• We get short (100-250 bp) reads, with high accuracy 
• Reads are (usually) paired 
http://en.wikipedia.org/wiki/File:Sequencing_by_synthesis_Reversible_terminators.png
Current Pipelines are 
Reference Based 
• Map subsequences to a “reference genome” 
• Compute variants (diffs) against the reference 
From “GATK Best Practices”, https://www.broadinstitute.org/gatk/guide/best-practices
An aside: What is the 
reference genome? 
• Pool together n individuals, and assemble their genomes 
together 
• A few problems: 
• How does the reference genome handle polymorphisms? 
• What about structural rearrangements? 
• Subpopulation specific alternate haplotypes? 
• It has gaps. 14 years after the first human reference 
genome was released, it is still incomplete.* 
* This problem is Hard.
The Sequencing Abstraction 
It was the best of times, it was the worst of times… 
It was the 
the best of 
times, it was 
worst of times 
the worst of 
• Sample poisson distributed substrings from a 
larger string 
• Reads are more or less unique and correct 
Metaphor borrowed from Michael Schatz 
best of times was the worst
…is a leaky abstraction 
• We frequently encounter “gaps” in the sequence 
Ross et al, Genome Biology 2013
…is a leakier abstraction 
• We preferentially sequence from “biased” regions: 
Ross et al, Genome Biology 2013
A very leaky abstraction! 
• Reads aren’t actually correct 
• >2% error (expect 0.1% variation) 
• Error probability estimates are cruddy 
• Reads aren’t actually unique 
• >7% of the genome is not unique (K. Curtis, SiRen)
The State of Analysis 
• We’re really good at calling SNPs! 
• But, we’re still pretty bad at calling INDELs, and SVs 
• And we’re also bad at expressing diffs 
• Hence, SMaSH! But really, reference + diff format need to be burnt to the 
ground and redesigned. 
• And, its slow. 2 weeks to sequence, 1 week to 
analyze. Not fast enough for practical clinical use.
Opportunities 
• New read technologies are available 
• Provide much longer reads (250bp vs. >10kbp) 
• Different error model… (15% INDEL errors, vs. 2% 
SNP errors) 
• Generally, lower sequence specific bias 
Left: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/
If long reads are available… 
• We can use conventional methods: 
Carneiro et al, Genome Biology 2012
But! 
• Why not make raw assemblies out of the reads? 
Find overlapping reads Find consensus sequence 
for all pairs of reads (i,j): 
i j 
=? 
…ACACTGCGACTCATCGACTC… 
• Problems: 
1. Overlapping is O(n 
2 
) and single evaluation is expensive anyways 
2. Typical algorithms find a single consensus sequence; what if we’ve got 
polymorphisms?
Fast Overlapping with 
MinHashing 
• Wonderful realization by Berlin et al1: overlapping is 
similar to document similarity problem 
• Use MinHashing to approximate similarity: 
1: Berlin et al, bioRxiv 2014 
Per document/read, 
compute signature:! 
! 
1. Cut into shingles 
2. Apply random 
hashes to shingles 
3. Take min over all 
random hashes 
Hash into buckets:! 
! 
Signatures of length l 
can be hashed into b 
buckets, so we expect 
to compare all elements 
with similarity 
≥ (1/b)^(b/l) 
Compare:! 
! 
For two documents with 
signatures of length l, 
Jaccard similarity is 
estimated by 
(# equal hashes) / l 
! 
• Easy to implement in Spark: map, groupBy, map, filter
Overlaps to Assemblies 
• Finding pairwise overlaps gives us a directed 
graph between reads (lots of edges!)
Transitive Reduction 
• We can find a consensus between clique members 
• Or, we can reduce down: 
• Via two iterations of Pregel!
Actually Making Calls 
• From here, we need to call copy number per edge 
• Probably via Newton-Raphson based on coverage; we’re not sure yet. 
• Then, per position in each edge, call alleles: 
Notes:! 
Equation is from Li, Bioinformatics 2011 
g = genotype state 
m = ploidy 
휖 = probability allele was erroneously observed 
k = number of reads observed 
l = number of reads observed matching “reference” allele 
TBD: equation assumes biallelic observations at site and reference allele; we won’t have either of those conveniences…
Output 
• Current assemblers emit FASTA contigs 
• In layperson’s speak: long strings 
• We’ll emit “multigs”, which we’ll map back to reference 
graph 
• Multig = multi-allelic (polymorphic) contig 
• Working with UCSC, who’ve done some really neat work1 
deriving formalisms & building software for mapping 
between sequence graphs, and GA4GH ref. variation team 
1. Paten et al, “Mapping to a Reference Genome Structure”, arXiv 2014.

More Related Content

PDF
Long-read: assets and challenges of a (not so) emerging technology
PPTX
2013 py con awesome big data algorithms
PDF
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
PPTX
2013 bms-retreat-talk
PPTX
2013 hmp-assembly-webinar
PDF
Theory and practice of graphical population analysis
PPTX
2015 beacon-metagenome-tutorial
PDF
Scalable Genome Analysis With ADAM
Long-read: assets and challenges of a (not so) emerging technology
2013 py con awesome big data algorithms
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
2013 bms-retreat-talk
2013 hmp-assembly-webinar
Theory and practice of graphical population analysis
2015 beacon-metagenome-tutorial
Scalable Genome Analysis With ADAM

Similar to PacMin @ AMPLab All-Hands (20)

PPTX
2013 siam-cse-big-data
PPTX
2014 bangkok-talk
PPTX
CoE-WEBINAR-2_042117v3.pptx
PPTX
GLBIO/CCBC Metagenomics Workshop
PPTX
HPCAC - the state of bioinformatics in 2017
PDF
Amorphous Computing (Computación Amorfa)
PDF
Lecture on the annotation of transposable elements
PDF
Sequencing run grief counseling: counting kmers at MG-RAST
PDF
Apolo Taller en BIOS
PPT
Assembling NGS Data - IMB Winter School - 3 July 2012
PPTX
Stamps.pptx
PDF
Apollo Introduction for i5K Groups 2015-10-07
PDF
Apollo Workshop at KSU 2015
PDF
Introduction to Apollo - i5k Research Community – Calanoida (copepod)
PPTX
2014 nicta-reproducibility
PDF
UC Davis EVE161 Lecture 10 by @phylogenomics
PDF
ADAM—Spark Summit, 2014
PPTX
Giab poster structural variants ashg 2018
PPTX
from genome sequencing to genome assembly
PPTX
2014 nci-edrn
2013 siam-cse-big-data
2014 bangkok-talk
CoE-WEBINAR-2_042117v3.pptx
GLBIO/CCBC Metagenomics Workshop
HPCAC - the state of bioinformatics in 2017
Amorphous Computing (Computación Amorfa)
Lecture on the annotation of transposable elements
Sequencing run grief counseling: counting kmers at MG-RAST
Apolo Taller en BIOS
Assembling NGS Data - IMB Winter School - 3 July 2012
Stamps.pptx
Apollo Introduction for i5K Groups 2015-10-07
Apollo Workshop at KSU 2015
Introduction to Apollo - i5k Research Community – Calanoida (copepod)
2014 nicta-reproducibility
UC Davis EVE161 Lecture 10 by @phylogenomics
ADAM—Spark Summit, 2014
Giab poster structural variants ashg 2018
from genome sequencing to genome assembly
2014 nci-edrn
Ad

More from fnothaft (12)

PDF
Scalable Genome Analysis with ADAM
PDF
Rethinking Data-Intensive Science Using Scalable Analytics Systems
PDF
Fast Variant Calling with ADAM and avocado
PDF
Scaling Genomic Analyses
PDF
Scaling up genomic analysis with ADAM
PDF
Scaling up genomic analysis with ADAM
PDF
Reproducible Emulation of Analog Behavioral Models
PDF
Scalable up genomic analysis with ADAM
PDF
CS176: Genome Assembly
PDF
Execution Environments
PDF
Design for Scalability in ADAM
PDF
Adam bosc-071114
Scalable Genome Analysis with ADAM
Rethinking Data-Intensive Science Using Scalable Analytics Systems
Fast Variant Calling with ADAM and avocado
Scaling Genomic Analyses
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
Reproducible Emulation of Analog Behavioral Models
Scalable up genomic analysis with ADAM
CS176: Genome Assembly
Execution Environments
Design for Scalability in ADAM
Adam bosc-071114
Ad

Recently uploaded (20)

PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
System and Network Administraation Chapter 3
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Digital Strategies for Manufacturing Companies
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
ai tools demonstartion for schools and inter college
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
Navsoft: AI-Powered Business Solutions & Custom Software Development
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
System and Network Administraation Chapter 3
How to Migrate SBCGlobal Email to Yahoo Easily
VVF-Customer-Presentation2025-Ver1.9.pptx
Odoo Companies in India – Driving Business Transformation.pdf
Which alternative to Crystal Reports is best for small or large businesses.pdf
Digital Strategies for Manufacturing Companies
Upgrade and Innovation Strategies for SAP ERP Customers
Odoo POS Development Services by CandidRoot Solutions
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Wondershare Filmora 15 Crack With Activation Key [2025
CHAPTER 2 - PM Management and IT Context
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
ai tools demonstartion for schools and inter college
How to Choose the Right IT Partner for Your Business in Malaysia
2025 Textile ERP Trends: SAP, Odoo & Oracle
L1 - Introduction to python Backend.pptx
Operating system designcfffgfgggggggvggggggggg

PacMin @ AMPLab All-Hands

  • 1. PacMin: rethinking genome analysis with long reads Frank Austin Nothaft, AMPLab Joint work with Adam Bloniarz 10/14/2014
  • 2. Note: • This talk is mostly speculative. • I.e., the methods we’ll talk about are partially* implemented. • This means you have an opportunity to steer the direction of this work! * I’m being generous to myself.
  • 3. Sequencing 101 • Most sequence data today comes from Illumina machines, which perform sequencing-by-synthesis ! ! ! • We get short (100-250 bp) reads, with high accuracy • Reads are (usually) paired http://en.wikipedia.org/wiki/File:Sequencing_by_synthesis_Reversible_terminators.png
  • 4. Current Pipelines are Reference Based • Map subsequences to a “reference genome” • Compute variants (diffs) against the reference From “GATK Best Practices”, https://www.broadinstitute.org/gatk/guide/best-practices
  • 5. An aside: What is the reference genome? • Pool together n individuals, and assemble their genomes together • A few problems: • How does the reference genome handle polymorphisms? • What about structural rearrangements? • Subpopulation specific alternate haplotypes? • It has gaps. 14 years after the first human reference genome was released, it is still incomplete.* * This problem is Hard.
  • 6. The Sequencing Abstraction It was the best of times, it was the worst of times… It was the the best of times, it was worst of times the worst of • Sample poisson distributed substrings from a larger string • Reads are more or less unique and correct Metaphor borrowed from Michael Schatz best of times was the worst
  • 7. …is a leaky abstraction • We frequently encounter “gaps” in the sequence Ross et al, Genome Biology 2013
  • 8. …is a leakier abstraction • We preferentially sequence from “biased” regions: Ross et al, Genome Biology 2013
  • 9. A very leaky abstraction! • Reads aren’t actually correct • >2% error (expect 0.1% variation) • Error probability estimates are cruddy • Reads aren’t actually unique • >7% of the genome is not unique (K. Curtis, SiRen)
  • 10. The State of Analysis • We’re really good at calling SNPs! • But, we’re still pretty bad at calling INDELs, and SVs • And we’re also bad at expressing diffs • Hence, SMaSH! But really, reference + diff format need to be burnt to the ground and redesigned. • And, its slow. 2 weeks to sequence, 1 week to analyze. Not fast enough for practical clinical use.
  • 11. Opportunities • New read technologies are available • Provide much longer reads (250bp vs. >10kbp) • Different error model… (15% INDEL errors, vs. 2% SNP errors) • Generally, lower sequence specific bias Left: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/
  • 12. If long reads are available… • We can use conventional methods: Carneiro et al, Genome Biology 2012
  • 13. But! • Why not make raw assemblies out of the reads? Find overlapping reads Find consensus sequence for all pairs of reads (i,j): i j =? …ACACTGCGACTCATCGACTC… • Problems: 1. Overlapping is O(n 2 ) and single evaluation is expensive anyways 2. Typical algorithms find a single consensus sequence; what if we’ve got polymorphisms?
  • 14. Fast Overlapping with MinHashing • Wonderful realization by Berlin et al1: overlapping is similar to document similarity problem • Use MinHashing to approximate similarity: 1: Berlin et al, bioRxiv 2014 Per document/read, compute signature:! ! 1. Cut into shingles 2. Apply random hashes to shingles 3. Take min over all random hashes Hash into buckets:! ! Signatures of length l can be hashed into b buckets, so we expect to compare all elements with similarity ≥ (1/b)^(b/l) Compare:! ! For two documents with signatures of length l, Jaccard similarity is estimated by (# equal hashes) / l ! • Easy to implement in Spark: map, groupBy, map, filter
  • 15. Overlaps to Assemblies • Finding pairwise overlaps gives us a directed graph between reads (lots of edges!)
  • 16. Transitive Reduction • We can find a consensus between clique members • Or, we can reduce down: • Via two iterations of Pregel!
  • 17. Actually Making Calls • From here, we need to call copy number per edge • Probably via Newton-Raphson based on coverage; we’re not sure yet. • Then, per position in each edge, call alleles: Notes:! Equation is from Li, Bioinformatics 2011 g = genotype state m = ploidy 휖 = probability allele was erroneously observed k = number of reads observed l = number of reads observed matching “reference” allele TBD: equation assumes biallelic observations at site and reference allele; we won’t have either of those conveniences…
  • 18. Output • Current assemblers emit FASTA contigs • In layperson’s speak: long strings • We’ll emit “multigs”, which we’ll map back to reference graph • Multig = multi-allelic (polymorphic) contig • Working with UCSC, who’ve done some really neat work1 deriving formalisms & building software for mapping between sequence graphs, and GA4GH ref. variation team 1. Paten et al, “Mapping to a Reference Genome Structure”, arXiv 2014.