(More!) Tools and
Algorithms for Genomic
Analysis on Spark
Ryan Williams
6/6/2017
- Guacamole: somatic variant caller on Spark
- magic-rdds: collections algorithms on RDDs
- slides, video
Previously, at Spark Summit East…
Slides: http://bit.ly/ss17-ryan
This episode
- coverage-depth analysis tool
- cluster bake-off: in-house hadoop vs. gcloud
- hadoop-bam: parable of a legacy genomics file
format in a distributed world
- bonus: suffix-arrays
Slides: http://bit.ly/ss17-ryan
Hammer Lab
- Mt. Sinai School of Medicine, Parker Institute for
Cancer Immunotherapy
- 12 people, mostly computational + ____
- personal genome vaccine trial(s) underway
- misc clinical data analysis
- long-running background thread porting biofx
tools to Spark
Slides: http://bit.ly/ss17-ryan
- Broad Institute
- GATK4 - next generation of GATK suite of tools
- Hail - variant analysis at scale
- AMP Lab: bigdatagenomics
- ADAM - QC / variant-calling / viz tools
- bdg-formats - avro schemas for genomic record-types
- Hammer Lab: pageant
- coverage-depth: QC analyses
- guacamole: somatic variant caller
Spark-based Genomic Analysis
tools/platforms
Slides: http://bit.ly/ss17-ryan
coverage-depth - joint histogram of distribution
of two samples
coverage-depth: progress and WIP
- running on google cloud and local hadoop cluster
- WIP: multi-plot.ly web-based report
- real-world use:
- “Contribution of systemic and somatic factors to clinical response and
resistance to PD-L1 blockade in urothelial cancer: An exploratory
multi-omic analysis”, Snyder et al. 2017
- upcoming lung-cancer study
- normalizing mutation counts by # exonic loci with depth ≥ cutoff
Slides: http://bit.ly/ss17-ryan
In-house Hadoop cluster
vs. Google Cloud Dataproc
- Demeter: 100-node, 2400-core cluster
- $500k circa 2013…
- ≈ half now?
- + X% sysadmin allocation
- Google Cloud Dataproc:
- pre-emptible nodes: $0.02/cpu/hr
- non-pre-emptible nodes: $0.06/cpu/hr
- 1 Demeter’s worth of cores for 4 years: $1.7MM
- utilization break-even range: 10-25%
Slides: http://bit.ly/ss17-ryan
Recent analysis: coverage-depth of
TCGA lung cancer BAMs
- 1060 BAMs (LUAD + LUSC): 14TB
- filter to ensembl exons + by minimum depth
- goal: normalize each sample’s mutation-count by its
number of exonic loci with sufficient depth
- 1 ephemeral cluster per app?
- or: 1 big cluster w/ many apps simultaneously
⇒ 10 dataproc clusters of 77 4-core nodes (308 cores)
- 10mins per sample, 2 samples on a cluster at a time
- 6hrs, $400
Slides: http://bit.ly/ss17-ryan
- Twist: 2 (of 1060) BAMs consistently failed:
“MRNM should not be set for unpaired read.”
- BAMs seemed ok in samtools
… debugging
Recent analysis: coverage-depth of
TCGA lung cancer BAMs
⟹ Bad splits!
Slides: http://bit.ly/ss17-ryan
Splitting BAM files
Splitting files
Record Record Record Record Record Record Record Record Record Record
Split 4Split 1 Split 3Split 2
Record Record Reco rd Record Record Rec ord Record Re cord Record Record
Machine A Machine B Machine C Machine D
Record Record Reco rd Record Record Rec ord Record Re cord Record Record
64MB
Reality:
hadoop-bam
- Implementation of Hadoop
File{In,Out}putFormat
- Original implementation circa 2010
- Semi-abandoned but critical library underneath
Hammer Lab, BDG, and Broad efforts
- Main goal: “split” BAM files
Slides: http://bit.ly/ss17-ryan
BAM SAM format
- Sequence Alignment/Map
@HD VN:1.4 GO:none SO:coordinate
@SQ SN:1 LN:249250621
@SQ SN:2 LN:243199373
…
HWI-ST807:8592:79724 163 1 10001 0 101M = 10009 109 TAACCCTAACC…
HWI-ST807:8592:79724 83 1 10009 0 101M = 10001 -109 ACCCTAACCCT…
HWI-ST807:9505:89866 163 1 10048 29 20M1D81M = 10368 374 CCAACCCTAAC…
HWI-ST807:6431:65669 163 1 10335 29 1S90M2D = 10458 224 CAACCCTAACC…
…
- Probably splittable (on newlines)?
Slides: http://bit.ly/ss17-ryan
Header
Reads
→ SAM format
+ Binary record codec:
BAM format
+ Block-gzip compression (BGZF):
#bytes contig start mapq len(name) name len(cigar) flags len(seq) cigar seq quals tags
“Magic”
1f 8b 08 04 Size Data …1f 8b 08 04 Size Data 1f 8b 08 04 Size Data
Slides: http://bit.ly/ss17-ryan
≤ 64k uncompressed,
≈ 20k compressed
Splitting BAMs - scan (≤ 64k) until magic
0x1f8b0804
- optional: skip ahead “size”
bytes, verify “magic” again
- certainty: (2^32)^(N blocks)
- BGZF:
- Binary records:
#bytes contig start mapq len(name) name len(cigar) flags len(seq) cigar seq quals tags
0 ≤ contig < #(contigs)
start ∈ [0,len(contig)) ASCII chars
‘0’-terminated
op lengths
≋
read length
ops ∈
{MIDNSHP=X}
“Magic”
1f 8b 08 04 Size Data …1f 8b 08 04 Size Data 1f 8b 08 04 Size Data
Case Study: BAM-splitting false positive
- TCGA 19155553-8199-4c4d-a35d-9a2f94dd2e7d,
offset 268458108:115
00 0f 01 00 00 00 00 00 00 70 0f 7d 01 22 00 3d 18 00 00 a5 00 4c 00 00 00 00 00 00 00 70 0f 7d 01
69376
bytes
contig
idx: 0
locus:
2098163712
mapq: 34
len(name): 1
24 cigar
ops
seq len:
19456
next-read
contig: 0
next locus:
2098163712
271 bytes
remaining
contig
idx: 0
locus:
24973168
mapq: 0
len(name): 34
0 cigar
ops
seq len:
76
next-read
contig: 0
next locus:
24973168
- “fork” of upstream
hadoop-bam
- additional checks avoid
known false-positives
hammerlab/hadoop-bam
* easy to add, seemingly unnecessary thus far
† partial credit; only 1 random check performed on
subsequent reads
- “check” mode evaluates
every position in BAM ⟶
- also: positions where ≤ 2
checks supported (true)
“negative” call
hammerlab/hadoop-bam invalidCigarOp: 28661374692
tooLargeNextReadIdx: 27924049452
tooLargeReadIdx: 27924049452
nonNullTerminatedReadName: 24885666031
tooFewRemainingBytesImplied: 23071387740
nonASCIIReadName: 2367016056
noReadName: 2271887125
negativeNextReadIdx: 1582430053
negativeReadIdx: 1582430053
negativeReadPos: 1582430053
negativeNextReadPos: 1582430053
emptyReadName: 232401822
tooLargeNextReadPos: 43095171
tooLargeReadPos: 43095171
tooFewBytesForReadName: 73
tooFewFixedBlockBytes: 35
tooFewBytesForCigarOps: 16
“Full” Checker - Spark History
- 10GB BAM, 30BN uncompressed positions, 94MM reads
- 100% checker accuracy
- Largest shuffle: 600+ GB
⇒ 20 bytes / position (compressed)
Slides: http://bit.ly/ss17-ryan
Parallelize split computation
Machine A Machine B Machine C Machine D
Record Record Reco rd Record Record Rec ord Record Re cord Record Record
DriverBefore:
After:
Driver
- 4 mins (200 splits)
- slow gcloud-storage seek
round-trips?
- 4mins → 8s
- ≈ 32 threads
Parallelize split computation, pt 2
Machine A Machine B Machine C Machine D
Record Record Reco rd Record Record Rec ord Record Re cord Record Record
Driver
- jury still out on whether this
makes sense
- probably not on current test
sets (10GB / 150 64MB splits)
- possibly on larger ones!
(150GB / 5k 32MB splits)
- VCFs being deprecated (at least culturally)
- BAMs seem like they’re sticking around
- Long reads may incentivize dropping BAM
- Aligners output BAMs
⟹ Someone should write a distributed aligner
Do we have to use BAMs?
Slides: http://bit.ly/ss17-ryan
hammerlab/suffix-arrays
- Distributed construction of suffix arrays and
FM-Indices
- WIP
- Open q’s
- how to use them in distributed env
- output binary-compatible indices that other tools
would generate?
Slides: http://bit.ly/ss17-ryan
Ongoing/Future Work
- release / publish Pageant suite of tools
- top of stack: guacamole (somatic variant caller)
- long reads?
Slides: http://bit.ly/ss17-ryan
Questions?

More Related Content

PDF
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
PDF
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
PDF
Top 5 mistakes when writing Spark applications
PDF
Spark with Cassandra by Christopher Batey
PPTX
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
PDF
Druid meetup 4th_sql_on_druid
PDF
Data Streaming Ecosystem Management at Booking.com
PDF
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Top 5 mistakes when writing Spark applications
Spark with Cassandra by Christopher Batey
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Druid meetup 4th_sql_on_druid
Data Streaming Ecosystem Management at Booking.com
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...

What's hot (20)

PPTX
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
PDF
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
PPTX
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
PDF
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
PDF
Managing your Black Friday Logs
PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
PDF
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
PPTX
Hive, Presto, and Spark on TPC-DS benchmark
PPTX
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
PDF
Spark performance tuning - Maksud Ibrahimov
PDF
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...
PDF
DataStax and Esri: Geotemporal IoT Search and Analytics
PDF
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
PPTX
A Developer’s View into Spark's Memory Model with Wenchen Fan
PDF
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
PDF
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
PDF
Spark Summit EU talk by Jorg Schad
PDF
Infrastructure Monitoring with Postgres
PDF
Using apache spark for processing trillions of records each day at Datadog
PPTX
Real time data pipeline with spark streaming and cassandra with mesos
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Managing your Black Friday Logs
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Hive, Presto, and Spark on TPC-DS benchmark
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Spark performance tuning - Maksud Ibrahimov
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...
DataStax and Esri: Geotemporal IoT Search and Analytics
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
A Developer’s View into Spark's Memory Model with Wenchen Fan
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Spark Summit EU talk by Jorg Schad
Infrastructure Monitoring with Postgres
Using apache spark for processing trillions of records each day at Datadog
Real time data pipeline with spark streaming and cassandra with mesos
Ad

Similar to More Algorithms and Tools for Genomic Analysis on Apache Spark with Ryan Williams (20)

PDF
PPTX
20141219 workshop methylation sequencing analysis
PPTX
Building an Automated Behavioral Malware Analysis Environment using Free and ...
PDF
Debugging Ruby Systems
PDF
Performance tweaks and tools for Linux (Joe Damato)
PPT
jvm goes to big data
PDF
Cassandra
PDF
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
PDF
Cassandra Community Webinar | In Case of Emergency Break Glass
PDF
Cassandra Community Webinar August 29th 2013 - In Case Of Emergency, Break Glass
PDF
Debugging Ruby
PDF
PerfUG 3 - perfs système
PPTX
Top 5 Java Performance Problems Presentation!
PPTX
JVM memory management & Diagnostics
PDF
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
PPTX
Stress your DUT
PPTX
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
PPTX
16 artifacts to capture when there is a production problem
PDF
Практический опыт профайлинга и оптимизации производительности Ruby-приложений
PPTX
Top-5-production-devconMunich-2023-v2.pptx
20141219 workshop methylation sequencing analysis
Building an Automated Behavioral Malware Analysis Environment using Free and ...
Debugging Ruby Systems
Performance tweaks and tools for Linux (Joe Damato)
jvm goes to big data
Cassandra
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
Cassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar August 29th 2013 - In Case Of Emergency, Break Glass
Debugging Ruby
PerfUG 3 - perfs système
Top 5 Java Performance Problems Presentation!
JVM memory management & Diagnostics
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Stress your DUT
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
16 artifacts to capture when there is a production problem
Практический опыт профайлинга и оптимизации производительности Ruby-приложений
Top-5-production-devconMunich-2023-v2.pptx
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PPTX
New ISO 27001_2022 standard and the changes
PDF
Microsoft 365 products and services descrption
PPTX
chrmotography.pptx food anaylysis techni
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
DOCX
Factor Analysis Word Document Presentation
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
Introduction to Inferential Statistics.pptx
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
modul_python (1).pptx for professional and student
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PDF
Microsoft Core Cloud Services powerpoint
PPT
statistic analysis for study - data collection
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
New ISO 27001_2022 standard and the changes
Microsoft 365 products and services descrption
chrmotography.pptx food anaylysis techni
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Factor Analysis Word Document Presentation
[EN] Industrial Machine Downtime Prediction
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
SAP 2 completion done . PRESENTATION.pptx
Pilar Kemerdekaan dan Identi Bangsa.pptx
Introduction to Inferential Statistics.pptx
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
retention in jsjsksksksnbsndjddjdnFPD.pptx
Navigating the Thai Supplements Landscape.pdf
STERILIZATION AND DISINFECTION-1.ppthhhbx
modul_python (1).pptx for professional and student
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Microsoft Core Cloud Services powerpoint
statistic analysis for study - data collection
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx

More Algorithms and Tools for Genomic Analysis on Apache Spark with Ryan Williams

  • 1. (More!) Tools and Algorithms for Genomic Analysis on Spark Ryan Williams 6/6/2017
  • 2. - Guacamole: somatic variant caller on Spark - magic-rdds: collections algorithms on RDDs - slides, video Previously, at Spark Summit East… Slides: http://bit.ly/ss17-ryan
  • 3. This episode - coverage-depth analysis tool - cluster bake-off: in-house hadoop vs. gcloud - hadoop-bam: parable of a legacy genomics file format in a distributed world - bonus: suffix-arrays Slides: http://bit.ly/ss17-ryan
  • 4. Hammer Lab - Mt. Sinai School of Medicine, Parker Institute for Cancer Immunotherapy - 12 people, mostly computational + ____ - personal genome vaccine trial(s) underway - misc clinical data analysis - long-running background thread porting biofx tools to Spark Slides: http://bit.ly/ss17-ryan
  • 5. - Broad Institute - GATK4 - next generation of GATK suite of tools - Hail - variant analysis at scale - AMP Lab: bigdatagenomics - ADAM - QC / variant-calling / viz tools - bdg-formats - avro schemas for genomic record-types - Hammer Lab: pageant - coverage-depth: QC analyses - guacamole: somatic variant caller Spark-based Genomic Analysis tools/platforms Slides: http://bit.ly/ss17-ryan
  • 6. coverage-depth - joint histogram of distribution of two samples
  • 7. coverage-depth: progress and WIP - running on google cloud and local hadoop cluster - WIP: multi-plot.ly web-based report - real-world use: - “Contribution of systemic and somatic factors to clinical response and resistance to PD-L1 blockade in urothelial cancer: An exploratory multi-omic analysis”, Snyder et al. 2017 - upcoming lung-cancer study - normalizing mutation counts by # exonic loci with depth ≥ cutoff Slides: http://bit.ly/ss17-ryan
  • 8. In-house Hadoop cluster vs. Google Cloud Dataproc - Demeter: 100-node, 2400-core cluster - $500k circa 2013… - ≈ half now? - + X% sysadmin allocation - Google Cloud Dataproc: - pre-emptible nodes: $0.02/cpu/hr - non-pre-emptible nodes: $0.06/cpu/hr - 1 Demeter’s worth of cores for 4 years: $1.7MM - utilization break-even range: 10-25% Slides: http://bit.ly/ss17-ryan
  • 9. Recent analysis: coverage-depth of TCGA lung cancer BAMs - 1060 BAMs (LUAD + LUSC): 14TB - filter to ensembl exons + by minimum depth - goal: normalize each sample’s mutation-count by its number of exonic loci with sufficient depth - 1 ephemeral cluster per app? - or: 1 big cluster w/ many apps simultaneously ⇒ 10 dataproc clusters of 77 4-core nodes (308 cores) - 10mins per sample, 2 samples on a cluster at a time - 6hrs, $400 Slides: http://bit.ly/ss17-ryan
  • 10. - Twist: 2 (of 1060) BAMs consistently failed: “MRNM should not be set for unpaired read.” - BAMs seemed ok in samtools … debugging Recent analysis: coverage-depth of TCGA lung cancer BAMs ⟹ Bad splits! Slides: http://bit.ly/ss17-ryan
  • 12. Splitting files Record Record Record Record Record Record Record Record Record Record Split 4Split 1 Split 3Split 2 Record Record Reco rd Record Record Rec ord Record Re cord Record Record Machine A Machine B Machine C Machine D Record Record Reco rd Record Record Rec ord Record Re cord Record Record 64MB Reality:
  • 13. hadoop-bam - Implementation of Hadoop File{In,Out}putFormat - Original implementation circa 2010 - Semi-abandoned but critical library underneath Hammer Lab, BDG, and Broad efforts - Main goal: “split” BAM files Slides: http://bit.ly/ss17-ryan
  • 14. BAM SAM format - Sequence Alignment/Map @HD VN:1.4 GO:none SO:coordinate @SQ SN:1 LN:249250621 @SQ SN:2 LN:243199373 … HWI-ST807:8592:79724 163 1 10001 0 101M = 10009 109 TAACCCTAACC… HWI-ST807:8592:79724 83 1 10009 0 101M = 10001 -109 ACCCTAACCCT… HWI-ST807:9505:89866 163 1 10048 29 20M1D81M = 10368 374 CCAACCCTAAC… HWI-ST807:6431:65669 163 1 10335 29 1S90M2D = 10458 224 CAACCCTAACC… … - Probably splittable (on newlines)? Slides: http://bit.ly/ss17-ryan Header Reads
  • 15. → SAM format + Binary record codec: BAM format + Block-gzip compression (BGZF): #bytes contig start mapq len(name) name len(cigar) flags len(seq) cigar seq quals tags “Magic” 1f 8b 08 04 Size Data …1f 8b 08 04 Size Data 1f 8b 08 04 Size Data Slides: http://bit.ly/ss17-ryan ≤ 64k uncompressed, ≈ 20k compressed
  • 16. Splitting BAMs - scan (≤ 64k) until magic 0x1f8b0804 - optional: skip ahead “size” bytes, verify “magic” again - certainty: (2^32)^(N blocks) - BGZF: - Binary records: #bytes contig start mapq len(name) name len(cigar) flags len(seq) cigar seq quals tags 0 ≤ contig < #(contigs) start ∈ [0,len(contig)) ASCII chars ‘0’-terminated op lengths ≋ read length ops ∈ {MIDNSHP=X} “Magic” 1f 8b 08 04 Size Data …1f 8b 08 04 Size Data 1f 8b 08 04 Size Data
  • 17. Case Study: BAM-splitting false positive - TCGA 19155553-8199-4c4d-a35d-9a2f94dd2e7d, offset 268458108:115 00 0f 01 00 00 00 00 00 00 70 0f 7d 01 22 00 3d 18 00 00 a5 00 4c 00 00 00 00 00 00 00 70 0f 7d 01 69376 bytes contig idx: 0 locus: 2098163712 mapq: 34 len(name): 1 24 cigar ops seq len: 19456 next-read contig: 0 next locus: 2098163712 271 bytes remaining contig idx: 0 locus: 24973168 mapq: 0 len(name): 34 0 cigar ops seq len: 76 next-read contig: 0 next locus: 24973168
  • 18. - “fork” of upstream hadoop-bam - additional checks avoid known false-positives hammerlab/hadoop-bam * easy to add, seemingly unnecessary thus far † partial credit; only 1 random check performed on subsequent reads
  • 19. - “check” mode evaluates every position in BAM ⟶ - also: positions where ≤ 2 checks supported (true) “negative” call hammerlab/hadoop-bam invalidCigarOp: 28661374692 tooLargeNextReadIdx: 27924049452 tooLargeReadIdx: 27924049452 nonNullTerminatedReadName: 24885666031 tooFewRemainingBytesImplied: 23071387740 nonASCIIReadName: 2367016056 noReadName: 2271887125 negativeNextReadIdx: 1582430053 negativeReadIdx: 1582430053 negativeReadPos: 1582430053 negativeNextReadPos: 1582430053 emptyReadName: 232401822 tooLargeNextReadPos: 43095171 tooLargeReadPos: 43095171 tooFewBytesForReadName: 73 tooFewFixedBlockBytes: 35 tooFewBytesForCigarOps: 16
  • 20. “Full” Checker - Spark History - 10GB BAM, 30BN uncompressed positions, 94MM reads - 100% checker accuracy - Largest shuffle: 600+ GB ⇒ 20 bytes / position (compressed) Slides: http://bit.ly/ss17-ryan
  • 21. Parallelize split computation Machine A Machine B Machine C Machine D Record Record Reco rd Record Record Rec ord Record Re cord Record Record DriverBefore: After: Driver - 4 mins (200 splits) - slow gcloud-storage seek round-trips? - 4mins → 8s - ≈ 32 threads
  • 22. Parallelize split computation, pt 2 Machine A Machine B Machine C Machine D Record Record Reco rd Record Record Rec ord Record Re cord Record Record Driver - jury still out on whether this makes sense - probably not on current test sets (10GB / 150 64MB splits) - possibly on larger ones! (150GB / 5k 32MB splits)
  • 23. - VCFs being deprecated (at least culturally) - BAMs seem like they’re sticking around - Long reads may incentivize dropping BAM - Aligners output BAMs ⟹ Someone should write a distributed aligner Do we have to use BAMs? Slides: http://bit.ly/ss17-ryan
  • 24. hammerlab/suffix-arrays - Distributed construction of suffix arrays and FM-Indices - WIP - Open q’s - how to use them in distributed env - output binary-compatible indices that other tools would generate? Slides: http://bit.ly/ss17-ryan
  • 25. Ongoing/Future Work - release / publish Pageant suite of tools - top of stack: guacamole (somatic variant caller) - long reads? Slides: http://bit.ly/ss17-ryan