SlideShare a Scribd company logo
Why is Bioinformatics 
(well, really, “genomics”) 
a Good Fit for Spark? 
Timothy Danford 
AMPLab
A One-Slide Introduction to Genomics
Bioinformatics computation is batch 
processing and workflows 
● Bioinformatics has a lot of 
“workflow engines” 
○ Galaxy, Taverna, Firehose, Zamboni, 
Queue, Luigi, bPipe 
○ bash scripts 
○ even make, fer cryin’ out loud 
○ a new one every day 
● Bioinformatics software 
development is still largely a 
research activity
State-of-the-Art infrastructure: 
shared filesystems, handwritten parallelism 
● Hand-written task creation 
● File formats instead of APIs or 
data models 
○ formats are poorly defined 
○ contain optional or 
redundant fields 
○ semantics are unclear 
● Workflow engines can’t take 
advantage of common 
parallelism between stages
Why is Bioinformatics a Good Fit for Spark?
So, why Spark?
Most of Genomics is 1-D Geometry
Most of Genomics is 1-D Geometry
The rest is iterative evaluation of 
probabilistic models!
Spark RDDs and Partitioners allow 
declarative parallelization for genomics 
● Genomics computation 
is parallelized in a small, 
standard number of 
ways 
○ by position 
○ by sample 
● Declarative, flexible 
partitioning schemes 
are useful
Spark can easily express genomics primitives: 
join by genomic overlap 
1. Calculate disjoint 
regions based on left 
(blue) set 
2. Partition both sets by 
disjoint regions 
3. Merge-join within each 
partition 
4. (Optional) aggregation 
across joined pairs
ADAM is Genomics + Spark 
● A rewrite of core bioinformatics tools and algorithms in Spark 
● Combines three 
technologies 
○ Spark 
○ Parquet 
○ Avro 
● Apache 2-licensed 
● Started at the AMPLab 
http://bdgenomics.org/
Avro and Parquet are just as critical to 
ADAM as Spark 
● Avro to define data models 
● Parquet for serialization format 
● Still need to answer design 
questions 
○ how wide are the schemas? 
○ how much do we follow existing 
formats? 
○ how do carry through projections?
Still need to convince bioinformaticians to 
rewrite their software! 
Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
Still need to convince bioinformaticians to 
rewrite their software! 
● A single piece of a 
single filtering stage 
for a somatic variant 
caller 
● “11-base-pair window 
centered on a candidate 
mutation” actually 
turns out to be 
optimized for a 
particular file format 
and sort order 
Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
The Future: 
Distributed and Incremental? 
● Today: 5k samples x 20 Gb / sample 
● Tomorrow: 1m+ samples @ 200+ Gb / sample? 
● More and more analysis is aggregative 
○ joint variant calling, 
○ panels of normal samples, 
○ collective variant annotation 
● And “data collection” will never be finished
Acknowledgements 
Matt Massie (AMPLab) 
Frank Nothaft (AMPLab) 
Carl Yeksigian (DataStax) 
Anthony Philippakis (Broad Institute) 
Jeff Hammerbacher (Cloudera / Mt. Sinai) 
Thank you! 
(questions?)

More Related Content

PDF
Spark Summit East 2015
PPTX
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
PPTX
Genomics Is Not Special: Towards Data Intensive Biology
PDF
Lightning fast genomics with Spark, Adam and Scala
PDF
Scaling up genomic analysis with ADAM
PDF
Ga4 gh meeting at the the sanger institute
PDF
Design for Scalability in ADAM
PPTX
Genome Analysis Pipelines with Spark and ADAM
Spark Summit East 2015
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
Genomics Is Not Special: Towards Data Intensive Biology
Lightning fast genomics with Spark, Adam and Scala
Scaling up genomic analysis with ADAM
Ga4 gh meeting at the the sanger institute
Design for Scalability in ADAM
Genome Analysis Pipelines with Spark and ADAM

What's hot (20)

PDF
Scalable up genomic analysis with ADAM
PDF
Fast Variant Calling with ADAM and avocado
PPTX
Hadoop for Bioinformatics: Building a Scalable Variant Store
PDF
ADAM—Spark Summit, 2014
PDF
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
PDF
Spark Summit Europe: Share and analyse genomic data at scale
PDF
Spark meetup london share and analyse genomic data at scale with spark, adam...
PDF
Scalable Genome Analysis with ADAM
PDF
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
PDF
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
PPTX
Big Data Science with H2O in R
PPTX
Ase2010 shang
PDF
The Materials Project - Combining Science and Informatics to Accelerate Mater...
PDF
Mining and Untangling Change Genealogies (PhD Defense Talk)
PPTX
Bridging Batch and Real-time Systems for Anomaly Detection
PDF
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
PPTX
Democratizing Big Semantic Data management
PDF
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
PDF
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
PPTX
RDF Stream Processing: Let's React
Scalable up genomic analysis with ADAM
Fast Variant Calling with ADAM and avocado
Hadoop for Bioinformatics: Building a Scalable Variant Store
ADAM—Spark Summit, 2014
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Spark Summit Europe: Share and analyse genomic data at scale
Spark meetup london share and analyse genomic data at scale with spark, adam...
Scalable Genome Analysis with ADAM
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Big Data Science with H2O in R
Ase2010 shang
The Materials Project - Combining Science and Informatics to Accelerate Mater...
Mining and Untangling Change Genealogies (PhD Defense Talk)
Bridging Batch and Real-time Systems for Anomaly Detection
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Democratizing Big Semantic Data management
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
RDF Stream Processing: Let's React
Ad

Similar to Why is Bioinformatics a Good Fit for Spark? (20)

PPT
Strata-Hadoop 2015 Presentation
PDF
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
PPTX
Big data analysing genomics and the bdg project
PDF
Adam bosc-071114
PPTX
VariantSpark: applying Spark-based machine learning methods to genomic inform...
PDF
Processing 70Tb Of Genomics Data With ADAM And Toil
PDF
Rethinking data intensive science using scalable analytics systems
PPTX
CS Lecture 2017 04-11 from Data to Precision Medicine
PDF
Adam
PPTX
Bioinformatics and its applications-converted.pptx
PPTX
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
PDF
PDF
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
PPTX
Hadoop as a Platform for Genomics - Strata 2015, San Jose
PPTX
Free Code Friday: Genome Resequencing with Spark, Part 1
PDF
Hadoop as a Platform for Genomics
PPTX
Closing the Gap in Time: From Raw Data to Real Science
PPTX
Jillian ms defense-4-14-14-ja
PPTX
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
PPTX
Emerging challenges in data-intensive genomics
Strata-Hadoop 2015 Presentation
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Big data analysing genomics and the bdg project
Adam bosc-071114
VariantSpark: applying Spark-based machine learning methods to genomic inform...
Processing 70Tb Of Genomics Data With ADAM And Toil
Rethinking data intensive science using scalable analytics systems
CS Lecture 2017 04-11 from Data to Precision Medicine
Adam
Bioinformatics and its applications-converted.pptx
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Free Code Friday: Genome Resequencing with Spark, Part 1
Hadoop as a Platform for Genomics
Closing the Gap in Time: From Raw Data to Real Science
Jillian ms defense-4-14-14-ja
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Emerging challenges in data-intensive genomics
Ad

Recently uploaded (20)

PPTX
Slider: TOC sampling methods for cleaning validation
PPTX
DENTAL CARIES FOR DENTISTRY STUDENT.pptx
DOCX
RUHS II MBBS Microbiology Paper-II with Answer Key | 6th August 2025 (New Sch...
PPTX
Note on Abortion.pptx for the student note
PPTX
Imaging of parasitic D. Case Discussions.pptx
PPTX
Acid Base Disorders educational power point.pptx
PPT
1b - INTRODUCTION TO EPIDEMIOLOGY (comm med).ppt
PDF
Khadir.pdf Acacia catechu drug Ayurvedic medicine
PPTX
NEET PG 2025 Pharmacology Recall | Real Exam Questions from 3rd August with D...
PPTX
neonatal infection(7392992y282939y5.pptx
PPTX
Important Obstetric Emergency that must be recognised
PPTX
JUVENILE NASOPHARYNGEAL ANGIOFIBROMA.pptx
PDF
Human Health And Disease hggyutgghg .pdf
PPTX
1 General Principles of Radiotherapy.pptx
PPT
Obstructive sleep apnea in orthodontics treatment
PDF
Intl J Gynecology Obste - 2021 - Melamed - FIGO International Federation o...
PPT
OPIOID ANALGESICS AND THEIR IMPLICATIONS
PPTX
Chapter-1-The-Human-Body-Orientation-Edited-55-slides.pptx
PPT
genitourinary-cancers_1.ppt Nursing care of clients with GU cancer
PPTX
Pathophysiology And Clinical Features Of Peripheral Nervous System .pptx
Slider: TOC sampling methods for cleaning validation
DENTAL CARIES FOR DENTISTRY STUDENT.pptx
RUHS II MBBS Microbiology Paper-II with Answer Key | 6th August 2025 (New Sch...
Note on Abortion.pptx for the student note
Imaging of parasitic D. Case Discussions.pptx
Acid Base Disorders educational power point.pptx
1b - INTRODUCTION TO EPIDEMIOLOGY (comm med).ppt
Khadir.pdf Acacia catechu drug Ayurvedic medicine
NEET PG 2025 Pharmacology Recall | Real Exam Questions from 3rd August with D...
neonatal infection(7392992y282939y5.pptx
Important Obstetric Emergency that must be recognised
JUVENILE NASOPHARYNGEAL ANGIOFIBROMA.pptx
Human Health And Disease hggyutgghg .pdf
1 General Principles of Radiotherapy.pptx
Obstructive sleep apnea in orthodontics treatment
Intl J Gynecology Obste - 2021 - Melamed - FIGO International Federation o...
OPIOID ANALGESICS AND THEIR IMPLICATIONS
Chapter-1-The-Human-Body-Orientation-Edited-55-slides.pptx
genitourinary-cancers_1.ppt Nursing care of clients with GU cancer
Pathophysiology And Clinical Features Of Peripheral Nervous System .pptx

Why is Bioinformatics a Good Fit for Spark?

  • 1. Why is Bioinformatics (well, really, “genomics”) a Good Fit for Spark? Timothy Danford AMPLab
  • 3. Bioinformatics computation is batch processing and workflows ● Bioinformatics has a lot of “workflow engines” ○ Galaxy, Taverna, Firehose, Zamboni, Queue, Luigi, bPipe ○ bash scripts ○ even make, fer cryin’ out loud ○ a new one every day ● Bioinformatics software development is still largely a research activity
  • 4. State-of-the-Art infrastructure: shared filesystems, handwritten parallelism ● Hand-written task creation ● File formats instead of APIs or data models ○ formats are poorly defined ○ contain optional or redundant fields ○ semantics are unclear ● Workflow engines can’t take advantage of common parallelism between stages
  • 7. Most of Genomics is 1-D Geometry
  • 8. Most of Genomics is 1-D Geometry
  • 9. The rest is iterative evaluation of probabilistic models!
  • 10. Spark RDDs and Partitioners allow declarative parallelization for genomics ● Genomics computation is parallelized in a small, standard number of ways ○ by position ○ by sample ● Declarative, flexible partitioning schemes are useful
  • 11. Spark can easily express genomics primitives: join by genomic overlap 1. Calculate disjoint regions based on left (blue) set 2. Partition both sets by disjoint regions 3. Merge-join within each partition 4. (Optional) aggregation across joined pairs
  • 12. ADAM is Genomics + Spark ● A rewrite of core bioinformatics tools and algorithms in Spark ● Combines three technologies ○ Spark ○ Parquet ○ Avro ● Apache 2-licensed ● Started at the AMPLab http://bdgenomics.org/
  • 13. Avro and Parquet are just as critical to ADAM as Spark ● Avro to define data models ● Parquet for serialization format ● Still need to answer design questions ○ how wide are the schemas? ○ how much do we follow existing formats? ○ how do carry through projections?
  • 14. Still need to convince bioinformaticians to rewrite their software! Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
  • 15. Still need to convince bioinformaticians to rewrite their software! ● A single piece of a single filtering stage for a somatic variant caller ● “11-base-pair window centered on a candidate mutation” actually turns out to be optimized for a particular file format and sort order Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
  • 16. The Future: Distributed and Incremental? ● Today: 5k samples x 20 Gb / sample ● Tomorrow: 1m+ samples @ 200+ Gb / sample? ● More and more analysis is aggregative ○ joint variant calling, ○ panels of normal samples, ○ collective variant annotation ● And “data collection” will never be finished
  • 17. Acknowledgements Matt Massie (AMPLab) Frank Nothaft (AMPLab) Carl Yeksigian (DataStax) Anthony Philippakis (Broad Institute) Jeff Hammerbacher (Cloudera / Mt. Sinai) Thank you! (questions?)