SlideShare a Scribd company logo
Design for Scalability
in ADAM
Frank Austin Nothaft
UC Berkeley
What is ADAM?
• An open source, high performance, distributed
platform for genomic analysis
• ADAM defines a:
1. Data schema and layout on disk*
2. A Scala API
3. A command line interface
* Via Avro and Parquet
What’s the big picture?
ADAM:!
Core API +
CLIs
bdg-formats:!
Data schemas
RNAdam:!
RNA analysis on
ADAM
avocado:!
Distributed local
assembler
Guacamole:!
Distributed
somatic caller
xASSEMBLEx:!
GraphX-based de
novo assembler
bdg-services:!
ADAM clusters
Implementation Overview
• 27k LOC (99% Scala)
• Apache 2 licensed OSS
• 21 contributors across 8 institutions
• Pushing for production 1.0 release towards end of year
Key Observations
• Current genomics pipelines are I/O limited
• Most genomics algorithms can be formulated as a
data or graph parallel computation
• These algorithms are heavy on iteration/pipelining
• Data access pattern is write once, read many times
• High coverage, whole genome will become main
sequencing target (for human genetics)
Principles for Scalable
Design in ADAM
• Parallel FS and data representation (HDFS +
Parquet) combined with in-memory computing
eliminates disk bandwidth bottleneck
• Spark allows efficient implementation of iterative/
pipelined Map-Reduce
• Minimize data movement: send code to data
Data Format
• Avro schema encoded by
Parquet
• Schema can be updated
without breaking backwards
compatibility
• Read schema looks a lot like
BAM, but renormalized
• Actively removing tags
• Variant schema is strictly
biallelic, a “cell in the matrix”
record AlignmentRecord {	
union { null, Contig } contig = null;	
union { null, long } start = null;	
union { null, long } end = null;	
union { null, int } mapq = null;	
union { null, string } readName = null;	
union { null, string } sequence = null;	
union { null, string } mateReference = null;	
union { null, long } mateAlignmentStart = null;	
union { null, string } cigar = null;	
union { null, string } qual = null;	
union { null, string } recordGroupName = null;	
union { int, null } basesTrimmedFromStart = 0;	
union { int, null } basesTrimmedFromEnd = 0;	
union { boolean, null } readPaired = false;	
union { boolean, null } properPair = false;	
union { boolean, null } readMapped = false;	
union { boolean, null } mateMapped = false;	
union { boolean, null } firstOfPair = false;	
union { boolean, null } secondOfPair = false;	
union { boolean, null } failedVendorQualityChecks = false;	
union { boolean, null } duplicateRead = false;	
union { boolean, null } readNegativeStrand = false;	
union { boolean, null } mateNegativeStrand = false;	
union { boolean, null } primaryAlignment = false;	
union { boolean, null } secondaryAlignment = false;	
union { boolean, null } supplementaryAlignment = false;	
union { null, string } mismatchingPositions = null;	
union { null, string } origQual = null;	
union { null, string } attributes = null;	
union { null, string } recordGroupSequencingCenter = null;	
union { null, string } recordGroupDescription = null;	
union { null, long } recordGroupRunDateEpoch = null;	
union { null, string } recordGroupFlowOrder = null;	
union { null, string } recordGroupKeySequence = null;	
union { null, string } recordGroupLibrary = null;	
union { null, int } recordGroupPredictedMedianInsertSize = null;	
union { null, string } recordGroupPlatform = null;	
union { null, string } recordGroupPlatformUnit = null;	
union { null, string } recordGroupSample = null;	
union { null, Contig} mateContig = null;	
}
Parquet
• ASF Incubator project, based on
Google Dremel
• http://www.parquet.io
• High performance columnar
store with support for projections
and push-down predicates
• 3 layers of parallelism:
• File/row group
• Column chunk
• Page
Image from Parquet format definition: https://github.com/Parquet/parquet-format
Filtering
• Parquet provides pushdown predication
• Evaluate filter on a subset of columns
• Only read full set of projected columns for passing records
• Full primary/secondary indexing support in Parquet 2.0
• Very efficient if reading a small set of columns:
• On disk, contig ID/start/end consume < 2% of space
Image from Parquet format definition: https://github.com/Parquet/parquet-format
Compression
• Parquet compresses
at the column level:
• RLE for repetitive
columns
• Dictionary
encoding for
quantized
columns
• ADAM uses a fully
denormalized schema
• Repetitive columns are
RLE’d out
• Delta encoding
(Parquet 2.0) will aid
with quality scores
• ADAM is 5-25% smaller
than compressed BAM
Parquet/Spark Integration
• 1 row group in Parquet maps
to 1 partition in Spark
• We interact with Parquet via
input/output formats
• These apply projections
and predicates, handle
(de)compression
• Spark builds and executes a
computation DAG, manages
data locality, errors/retries, etc.
RG 1 RG 2 RG n…
Parquet
RG 1 RG 2 RG n…
Parquet
Spark
Parquet Input Format
Parquet Output Format
Partition
1
Partition
2
Partition
n
…
Compatibility
• Maintain full import/export compatibility with SAM/
BAM, VCF/BCF
• Can use non-ADAM tools in pipeline:*
* Via avocado: https://www.github.com/bigdatagenomics/avocado
RG 1
RG 2
RG n
…
Part. 1
Part. 2
Part. n
…
Chr. 1 into Pipe
Chr. 2 into Pipe
Chr. M into Pipe
…
Repartition
Repartition
Part. 1
Part. 2
Part. n
…
RG 1
RG 2
RG n
…
“Cloud” Optimizations
• Emerging use case (?): processing data on public
cloud provider machines, data stored in block store
• E.g., Amazon EMR + S3
• We are optimizing Parquet for S3/other block
stores:
• Compact primary indices for slice lookup
• Eliminate Parquet requirement on HDFS
• An in-memory data parallel computing framework
• Optimized for iterative jobs —> unlike Hadoop
• Data maintained in memory unless inter-node
movement needed (e.g., on repartitioning)
• Presents a functional programing API, along with
support for iterative programming via REPL
• Used at scale on clusters with >2k nodes, 4TB
datasets
Why Spark?
• Current leading map-reduce framework:
• First in-memory map-reduce platform
• Used at scale in industry, supported in major distros (Cloudera,
HortonWorks, MapR)
• The API:
• Fully functional API
• Main API in Scala, also support Java, Python, R
• Manages node/job failures via lineage, data locality/job assignment
• Downstream tools (GraphX, MLLib)
Cluster Setups
• Spark is optimized for Hadoop, but is being run on
traditional HPC clusters (e.g., LBNL, Janelia Farm)
• Tachyon file system cache can be used as a
high performance layer between Spark and
HPC file systems
• At Berkeley, we normally run on cloud vendors
• Performance shows ~4x better on bare metal
Acknowledgements
• UC Berkeley: Matt Massie, André Schumacher,
Jey Kottalam, Christos Kozanitis!
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael
Linderman, Jeff Hammerbacher!
• GenomeBridge: Timothy Danford, Carl Yeksigian!
• Cloudera: Uri Laserson!
• Microsoft Research: Jeremy Elson, Ravi Pandya!
• And many other open source contributors: 21
contributors to ADAM/BDG from >8 institutions
Acknowledgements
This research is supported in part by NSF CISE
Expeditions Award CCF-1139158, LBNL Award
7076018, DARPA XData Award FA8750-12-2-0331,
and gifts from Amazon Web Services, Google,
SAP, The Thomas and Stacey Siebel Foundation,
Apple, Inc., C3Energy, Cisco, Cloudera, EMC,
Ericsson, Facebook, GameOnTalis, Guavus, HP,
Huawei, Intel, Microsoft, NetApp, Pivotal, Splunk,
Virdata, VMware, WANdisco and Yahoo!.

More Related Content

PDF
Scalable up genomic analysis with ADAM
PDF
Scalable Genome Analysis with ADAM
PDF
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
PDF
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
PDF
Scalable Genome Analysis With ADAM
PPTX
Managing Genomes At Scale: What We Learned - StampedeCon 2014
PDF
Spark Summit Europe: Share and analyse genomic data at scale
PDF
Lightning fast genomics with Spark, Adam and Scala
Scalable up genomic analysis with ADAM
Scalable Genome Analysis with ADAM
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Scalable Genome Analysis With ADAM
Managing Genomes At Scale: What We Learned - StampedeCon 2014
Spark Summit Europe: Share and analyse genomic data at scale
Lightning fast genomics with Spark, Adam and Scala

What's hot (20)

PDF
Fast Variant Calling with ADAM and avocado
PDF
Scaling up genomic analysis with ADAM
PPTX
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
PDF
Spark meetup london share and analyse genomic data at scale with spark, adam...
PDF
Why is Bioinformatics a Good Fit for Spark?
PDF
ADAM—Spark Summit, 2014
PPTX
Genomics Is Not Special: Towards Data Intensive Biology
PDF
Ga4 gh meeting at the the sanger institute
PDF
Spark Summit East 2015
PPTX
Hadoop for Bioinformatics: Building a Scalable Variant Store
PPT
Strata-Hadoop 2015 Presentation
PDF
From Genomics to Medicine: Advancing Healthcare at Scale
PDF
Rethinking Data-Intensive Science Using Scalable Analytics Systems
PPTX
Democratizing Big Semantic Data management
PPTX
Genome Analysis Pipelines with Spark and ADAM
PPT
2010 03 Lodoxf Openflydata
PPTX
Hadoop with Python
PDF
eScience Cluster Arch. Overview
PDF
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
PPTX
Bridging Batch and Real-time Systems for Anomaly Detection
Fast Variant Calling with ADAM and avocado
Scaling up genomic analysis with ADAM
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
Spark meetup london share and analyse genomic data at scale with spark, adam...
Why is Bioinformatics a Good Fit for Spark?
ADAM—Spark Summit, 2014
Genomics Is Not Special: Towards Data Intensive Biology
Ga4 gh meeting at the the sanger institute
Spark Summit East 2015
Hadoop for Bioinformatics: Building a Scalable Variant Store
Strata-Hadoop 2015 Presentation
From Genomics to Medicine: Advancing Healthcare at Scale
Rethinking Data-Intensive Science Using Scalable Analytics Systems
Democratizing Big Semantic Data management
Genome Analysis Pipelines with Spark and ADAM
2010 03 Lodoxf Openflydata
Hadoop with Python
eScience Cluster Arch. Overview
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
Bridging Batch and Real-time Systems for Anomaly Detection
Ad

Similar to Design for Scalability in ADAM (20)

PDF
Impala presentation ahad rana
PDF
SnappyData Overview Slidedeck for Big Data Bellevue
PPTX
Transformation Processing Smackdown; Spark vs Hive vs Pig
PDF
Creating PostgreSQL-as-a-Service at Scale
PPTX
Emerging technologies /frameworks in Big Data
PDF
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
PDF
Impala Architecture presentation
PPTX
Intro to SnappyData Webinar
PPTX
Hadoop and HBase experiences in perf log project
PDF
06 pig-01-intro
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PDF
Lecture 2 part 3
PDF
Apache Spark Tutorial
PDF
Spark Programming Basic Training Handout
PDF
Apache Spark for Everyone - Women Who Code Workshop
PDF
Presentations from the Cloudera Impala meetup on Aug 20 2013
PDF
Introduction to Impala
PDF
(Julien le dem) parquet
PDF
So You Want to Write a Connector?
PDF
Scalable Data Science in Python and R on Apache Spark
Impala presentation ahad rana
SnappyData Overview Slidedeck for Big Data Bellevue
Transformation Processing Smackdown; Spark vs Hive vs Pig
Creating PostgreSQL-as-a-Service at Scale
Emerging technologies /frameworks in Big Data
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
Impala Architecture presentation
Intro to SnappyData Webinar
Hadoop and HBase experiences in perf log project
06 pig-01-intro
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Lecture 2 part 3
Apache Spark Tutorial
Spark Programming Basic Training Handout
Apache Spark for Everyone - Women Who Code Workshop
Presentations from the Cloudera Impala meetup on Aug 20 2013
Introduction to Impala
(Julien le dem) parquet
So You Want to Write a Connector?
Scalable Data Science in Python and R on Apache Spark
Ad

More from fnothaft (7)

PDF
Scaling Genomic Analyses
PDF
Scaling up genomic analysis with ADAM
PDF
Reproducible Emulation of Analog Behavioral Models
PDF
CS176: Genome Assembly
PDF
Execution Environments
PDF
PacMin @ AMPLab All-Hands
PDF
Adam bosc-071114
Scaling Genomic Analyses
Scaling up genomic analysis with ADAM
Reproducible Emulation of Analog Behavioral Models
CS176: Genome Assembly
Execution Environments
PacMin @ AMPLab All-Hands
Adam bosc-071114

Recently uploaded (20)

PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Construction Project Organization Group 2.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Sustainable Sites - Green Building Construction
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Structs to JSON How Go Powers REST APIs.pdf
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
Well-logging-methods_new................
PPT
Mechanical Engineering MATERIALS Selection
PDF
Digital Logic Computer Design lecture notes
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
UNIT 4 Total Quality Management .pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Construction Project Organization Group 2.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
bas. eng. economics group 4 presentation 1.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Sustainable Sites - Green Building Construction
Model Code of Practice - Construction Work - 21102022 .pdf
Structs to JSON How Go Powers REST APIs.pdf
Arduino robotics embedded978-1-4302-3184-4.pdf
CYBER-CRIMES AND SECURITY A guide to understanding
Embodied AI: Ushering in the Next Era of Intelligent Systems
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Well-logging-methods_new................
Mechanical Engineering MATERIALS Selection
Digital Logic Computer Design lecture notes
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
UNIT 4 Total Quality Management .pptx

Design for Scalability in ADAM

  • 1. Design for Scalability in ADAM Frank Austin Nothaft UC Berkeley
  • 2. What is ADAM? • An open source, high performance, distributed platform for genomic analysis • ADAM defines a: 1. Data schema and layout on disk* 2. A Scala API 3. A command line interface * Via Avro and Parquet
  • 3. What’s the big picture? ADAM:! Core API + CLIs bdg-formats:! Data schemas RNAdam:! RNA analysis on ADAM avocado:! Distributed local assembler Guacamole:! Distributed somatic caller xASSEMBLEx:! GraphX-based de novo assembler bdg-services:! ADAM clusters
  • 4. Implementation Overview • 27k LOC (99% Scala) • Apache 2 licensed OSS • 21 contributors across 8 institutions • Pushing for production 1.0 release towards end of year
  • 5. Key Observations • Current genomics pipelines are I/O limited • Most genomics algorithms can be formulated as a data or graph parallel computation • These algorithms are heavy on iteration/pipelining • Data access pattern is write once, read many times • High coverage, whole genome will become main sequencing target (for human genetics)
  • 6. Principles for Scalable Design in ADAM • Parallel FS and data representation (HDFS + Parquet) combined with in-memory computing eliminates disk bandwidth bottleneck • Spark allows efficient implementation of iterative/ pipelined Map-Reduce • Minimize data movement: send code to data
  • 7. Data Format • Avro schema encoded by Parquet • Schema can be updated without breaking backwards compatibility • Read schema looks a lot like BAM, but renormalized • Actively removing tags • Variant schema is strictly biallelic, a “cell in the matrix” record AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig} mateContig = null; }
  • 8. Parquet • ASF Incubator project, based on Google Dremel • http://www.parquet.io • High performance columnar store with support for projections and push-down predicates • 3 layers of parallelism: • File/row group • Column chunk • Page Image from Parquet format definition: https://github.com/Parquet/parquet-format
  • 9. Filtering • Parquet provides pushdown predication • Evaluate filter on a subset of columns • Only read full set of projected columns for passing records • Full primary/secondary indexing support in Parquet 2.0 • Very efficient if reading a small set of columns: • On disk, contig ID/start/end consume < 2% of space Image from Parquet format definition: https://github.com/Parquet/parquet-format
  • 10. Compression • Parquet compresses at the column level: • RLE for repetitive columns • Dictionary encoding for quantized columns • ADAM uses a fully denormalized schema • Repetitive columns are RLE’d out • Delta encoding (Parquet 2.0) will aid with quality scores • ADAM is 5-25% smaller than compressed BAM
  • 11. Parquet/Spark Integration • 1 row group in Parquet maps to 1 partition in Spark • We interact with Parquet via input/output formats • These apply projections and predicates, handle (de)compression • Spark builds and executes a computation DAG, manages data locality, errors/retries, etc. RG 1 RG 2 RG n… Parquet RG 1 RG 2 RG n… Parquet Spark Parquet Input Format Parquet Output Format Partition 1 Partition 2 Partition n …
  • 12. Compatibility • Maintain full import/export compatibility with SAM/ BAM, VCF/BCF • Can use non-ADAM tools in pipeline:* * Via avocado: https://www.github.com/bigdatagenomics/avocado RG 1 RG 2 RG n … Part. 1 Part. 2 Part. n … Chr. 1 into Pipe Chr. 2 into Pipe Chr. M into Pipe … Repartition Repartition Part. 1 Part. 2 Part. n … RG 1 RG 2 RG n …
  • 13. “Cloud” Optimizations • Emerging use case (?): processing data on public cloud provider machines, data stored in block store • E.g., Amazon EMR + S3 • We are optimizing Parquet for S3/other block stores: • Compact primary indices for slice lookup • Eliminate Parquet requirement on HDFS
  • 14. • An in-memory data parallel computing framework • Optimized for iterative jobs —> unlike Hadoop • Data maintained in memory unless inter-node movement needed (e.g., on repartitioning) • Presents a functional programing API, along with support for iterative programming via REPL • Used at scale on clusters with >2k nodes, 4TB datasets
  • 15. Why Spark? • Current leading map-reduce framework: • First in-memory map-reduce platform • Used at scale in industry, supported in major distros (Cloudera, HortonWorks, MapR) • The API: • Fully functional API • Main API in Scala, also support Java, Python, R • Manages node/job failures via lineage, data locality/job assignment • Downstream tools (GraphX, MLLib)
  • 16. Cluster Setups • Spark is optimized for Hadoop, but is being run on traditional HPC clusters (e.g., LBNL, Janelia Farm) • Tachyon file system cache can be used as a high performance layer between Spark and HPC file systems • At Berkeley, we normally run on cloud vendors • Performance shows ~4x better on bare metal
  • 17. Acknowledgements • UC Berkeley: Matt Massie, André Schumacher, Jey Kottalam, Christos Kozanitis! • Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher! • GenomeBridge: Timothy Danford, Carl Yeksigian! • Cloudera: Uri Laserson! • Microsoft Research: Jeremy Elson, Ravi Pandya! • And many other open source contributors: 21 contributors to ADAM/BDG from >8 institutions
  • 18. Acknowledgements This research is supported in part by NSF CISE Expeditions Award CCF-1139158, LBNL Award 7076018, DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP, The Thomas and Stacey Siebel Foundation, Apple, Inc., C3Energy, Cisco, Cloudera, EMC, Ericsson, Facebook, GameOnTalis, Guavus, HP, Huawei, Intel, Microsoft, NetApp, Pivotal, Splunk, Virdata, VMware, WANdisco and Yahoo!.