SlideShare a Scribd company logo
ADAM: Fast, Scalable
Genome Analysis
Frank Austin Nothaft	

AMPLab, University of California, Berkeley, @fnothaft	

	

with: Matt Massie,André Schumacher,Timothy Danford, CarlYeksigian,
Chris Hartl, Jey Kottalam,Arun Aruha, Neal Sidhwaney, Michael Linderman,
Jeff Hammerbacher,Anthony Joseph, and Dave Patterson	

	

https://github.com/bigdatagenomics	

http://www.bdgenomics.org
What is in ADAM/BDG?
ADAM:
Core API +
CLIs
bdg-formats:
Data schemas
RNAdam:
RNA analysis on
ADAM
avocado:
Distributed local
assembler
Guacamole:
Distributed
somatic caller
xASSEMBLEx:
GraphX-based de
novo assembler
bdg-services:
ADAM clusters
Design Goals
• Develop processing pipeline that enables
efficient, scalable use of cluster/cloud	

• Provide data format that has efficient
parallel/distributed access across platforms	

• Enhance semantics of data and allow more
flexible data access patterns
Implementation Overview
• 27K lines of Scala code	

• 100% Apache-licensed open-source	

• 21 contributors from 8 institutions	

• Working towards a production quality release late 2014
ADAM Stack
Physical
File/Block
Record/Split
‣Commodity Hardware	

‣Cloud Systems - Amazon, GCE, Azure
‣Hadoop Distributed Filesystem	

‣Local Filesystem
‣Schema-driven records w/ Apache Avro	

‣Store and retrieve records using Parquet	

‣Read BAM Files using Hadoop-BAM
In-Memory
RDD
‣Transform records using Apache Spark	

‣Query with SQL using Shark	

‣Graph processing with GraphX	

‣Machine learning using MLBase
• Abstract as much as possible: schema
oriented design makes format easy to evolve	

• Provide rich and scalable APIs for manipulating
and transforming genomic data and regions	

• Don’t lock data in: play nicely with other tools
Design Principles
• OSS Created by Twitter and Cloudera, based on
Google Dremel, just entered Apache Incubator	

• Columnar File Format:	

• Limits I/O to only data that is needed	

• Compresses very well - ADAM files are 5-25%
smaller than BAM files without loss of data	

• Fast scans - load only columns you need, e.g.
scan a read flag on a whole genome, high-
coverage file in less than a minute
Parquet
Scaling Genomics: BQSR
• Broadcast 3 GB table of
variants, used for masking	

• Break reads down to
bases and map bases to
covariates	

• Calculate empirical values
per covariate	

• Broadcast observation,
apply across reads
Performance/Acc’y
ADAM
0
10
20
30
40
50
GATK
0 10 20 30 40 50
• Fully concordant with Picard for MarkDup, >99%
concordant with GATK for BQSR
Hours
0
4
8
12
16
20
24
Sort Mark Duplicates
BQSR
Picard ADAM 100 EC2 Nodes
Future Work
• Pushing hard towards production release	

• Are building out a complete analysis
pipeline	

• Plan to release Python bindings	

• Work on interoperability with Global
Alliance for Genomic Health API (http://
genomicsandhealth.org/)
Call for contributions
• As an open source project, we welcome
contributions	

• We maintain a list of open enhancements at
our Github issue trackers	

• Github: https://www.github.com/bdgenomics 	

• UC Berkeley is looking to hire two full time
engineers to support this work
Acknowledgements
• UC Berkeley: Matt Massie,André Schumacher, Jey Kottalam,
Christos Kozanitis	

• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael Linderman, Jeff
Hammerbacher	

• GenomeBridge: Timothy Danford, CarlYeksigian	

• The Broad Institute: Chris Hartl	

• Cloudera: Uri Laserson	

• Microsoft Research: Jeremy Elson, Ravi Pandya	

• Michael Heuer	

• And other open source contributors!
Acknowledgements
This research is supported in part by NSF CISE
Expeditions Award CCF-1139158, LBNL Award
7076018, and DARPA XData Award
FA8750-12-2-0331, and gifts from Amazon Web
Services, Google, SAP, The Thomas and Stacey
Siebel Foundation,Apple, Inc., C3Energy, Cisco,
Cloudera, EMC, Ericsson, Facebook, GameOnTalis,
Guavus, HP, Huawei, Intel, Microsoft, NetApp,
Pivotal, Splunk,Virdata,VMware,WANdisco and
Yahoo!.

More Related Content

PPTX
Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...
PDF
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
PDF
Performance improvements in etcd 3.5 release
PDF
Prashant Vichare Resume
PPTX
Scaling Graphite At Yelp
PDF
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
PPTX
Vineetha.ppt
PPTX
goto; London: Keeping your Cloud Footprint in Check
Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
Performance improvements in etcd 3.5 release
Prashant Vichare Resume
Scaling Graphite At Yelp
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
Vineetha.ppt
goto; London: Keeping your Cloud Footprint in Check

What's hot (20)

PDF
Fast and Reliable Apache Spark SQL Engine
PDF
Dataflow in 104corp - AWS UserGroup TW 2018
PDF
IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...
PDF
Realizing the promise of portability with Apache Beam
PDF
Monitoring Microservices
PDF
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
PPTX
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
PDF
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
PDF
Circonus: Design failures - A Case Study
PDF
Spark Summit EU talk by Sital Kedia
PDF
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
PDF
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
PPTX
Autoscaling with Kubernetes
PPTX
Lifting the Blinds: Monitoring Windows Server 2012
PDF
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
PDF
Querying Dynamic Datasources with Continuously Mapped Sensor Data
PPTX
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
PPTX
Autoscaling on Kubernetes
PPTX
Portable Streaming Pipelines with Apache Beam
PPTX
Gobblin on-aws
Fast and Reliable Apache Spark SQL Engine
Dataflow in 104corp - AWS UserGroup TW 2018
IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...
Realizing the promise of portability with Apache Beam
Monitoring Microservices
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
Circonus: Design failures - A Case Study
Spark Summit EU talk by Sital Kedia
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Autoscaling with Kubernetes
Lifting the Blinds: Monitoring Windows Server 2012
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Querying Dynamic Datasources with Continuously Mapped Sensor Data
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
Autoscaling on Kubernetes
Portable Streaming Pipelines with Apache Beam
Gobblin on-aws
Ad

Similar to Adam bosc-071114 (20)

PDF
The Open Chemistry Project
PDF
Avogadro, Open Chemistry and Semantics
PDF
Open Chemistry: Input Preparation, Data Visualization & Analysis
PPTX
HPC and cloud distributed computing, as a journey
PPTX
GlobusWorld 2020 Keynote
PDF
Big Data Streams Architectures. Why? What? How?
PDF
Processing 70Tb Of Genomics Data With ADAM And Toil
PDF
Introduction to Apache Mesos and DC/OS
PDF
Ceph used in Cancer Research at OICR
PPTX
Scientific
PPT
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
PDF
Application Profiling at the HPCAC High Performance Center
PDF
Making Apache Kafka Even Faster And More Scalable
PDF
Scaling Hadoop at LinkedIn
PDF
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
PPTX
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
PPTX
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
PDF
Realizing the Promise of Portable Data Processing with Apache Beam
PPTX
Getting started with postgresql
PPTX
Conceptualizing And Prototyping A Scalable Genomic Data Analysis Pipeline: Us...
The Open Chemistry Project
Avogadro, Open Chemistry and Semantics
Open Chemistry: Input Preparation, Data Visualization & Analysis
HPC and cloud distributed computing, as a journey
GlobusWorld 2020 Keynote
Big Data Streams Architectures. Why? What? How?
Processing 70Tb Of Genomics Data With ADAM And Toil
Introduction to Apache Mesos and DC/OS
Ceph used in Cancer Research at OICR
Scientific
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Application Profiling at the HPCAC High Performance Center
Making Apache Kafka Even Faster And More Scalable
Scaling Hadoop at LinkedIn
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Realizing the Promise of Portable Data Processing with Apache Beam
Getting started with postgresql
Conceptualizing And Prototyping A Scalable Genomic Data Analysis Pipeline: Us...
Ad

More from fnothaft (14)

PDF
Scalable Genome Analysis with ADAM
PDF
Rethinking Data-Intensive Science Using Scalable Analytics Systems
PDF
Scalable Genome Analysis With ADAM
PDF
Fast Variant Calling with ADAM and avocado
PDF
Scaling Genomic Analyses
PDF
Scaling up genomic analysis with ADAM
PDF
Scaling up genomic analysis with ADAM
PDF
Reproducible Emulation of Analog Behavioral Models
PDF
Scalable up genomic analysis with ADAM
PDF
CS176: Genome Assembly
PDF
Execution Environments
PDF
PacMin @ AMPLab All-Hands
PDF
Design for Scalability in ADAM
PDF
ADAM—Spark Summit, 2014
Scalable Genome Analysis with ADAM
Rethinking Data-Intensive Science Using Scalable Analytics Systems
Scalable Genome Analysis With ADAM
Fast Variant Calling with ADAM and avocado
Scaling Genomic Analyses
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
Reproducible Emulation of Analog Behavioral Models
Scalable up genomic analysis with ADAM
CS176: Genome Assembly
Execution Environments
PacMin @ AMPLab All-Hands
Design for Scalability in ADAM
ADAM—Spark Summit, 2014

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
KodekX | Application Modernization Development
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Approach and Philosophy of On baking technology
PPT
Teaching material agriculture food technology
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
Digital-Transformation-Roadmap-for-Companies.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Machine learning based COVID-19 study performance prediction
Building Integrated photovoltaic BIPV_UPV.pdf
Big Data Technologies - Introduction.pptx
KodekX | Application Modernization Development
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
NewMind AI Monthly Chronicles - July 2025
Approach and Philosophy of On baking technology
Teaching material agriculture food technology
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
“AI and Expert System Decision Support & Business Intelligence Systems”
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Understanding_Digital_Forensics_Presentation.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Mobile App Security Testing_ A Comprehensive Guide.pdf
Spectral efficient network and resource selection model in 5G networks

Adam bosc-071114

  • 1. ADAM: Fast, Scalable Genome Analysis Frank Austin Nothaft AMPLab, University of California, Berkeley, @fnothaft with: Matt Massie,André Schumacher,Timothy Danford, CarlYeksigian, Chris Hartl, Jey Kottalam,Arun Aruha, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher,Anthony Joseph, and Dave Patterson https://github.com/bigdatagenomics http://www.bdgenomics.org
  • 2. What is in ADAM/BDG? ADAM: Core API + CLIs bdg-formats: Data schemas RNAdam: RNA analysis on ADAM avocado: Distributed local assembler Guacamole: Distributed somatic caller xASSEMBLEx: GraphX-based de novo assembler bdg-services: ADAM clusters
  • 3. Design Goals • Develop processing pipeline that enables efficient, scalable use of cluster/cloud • Provide data format that has efficient parallel/distributed access across platforms • Enhance semantics of data and allow more flexible data access patterns
  • 4. Implementation Overview • 27K lines of Scala code • 100% Apache-licensed open-source • 21 contributors from 8 institutions • Working towards a production quality release late 2014
  • 5. ADAM Stack Physical File/Block Record/Split ‣Commodity Hardware ‣Cloud Systems - Amazon, GCE, Azure ‣Hadoop Distributed Filesystem ‣Local Filesystem ‣Schema-driven records w/ Apache Avro ‣Store and retrieve records using Parquet ‣Read BAM Files using Hadoop-BAM In-Memory RDD ‣Transform records using Apache Spark ‣Query with SQL using Shark ‣Graph processing with GraphX ‣Machine learning using MLBase
  • 6. • Abstract as much as possible: schema oriented design makes format easy to evolve • Provide rich and scalable APIs for manipulating and transforming genomic data and regions • Don’t lock data in: play nicely with other tools Design Principles
  • 7. • OSS Created by Twitter and Cloudera, based on Google Dremel, just entered Apache Incubator • Columnar File Format: • Limits I/O to only data that is needed • Compresses very well - ADAM files are 5-25% smaller than BAM files without loss of data • Fast scans - load only columns you need, e.g. scan a read flag on a whole genome, high- coverage file in less than a minute Parquet
  • 8. Scaling Genomics: BQSR • Broadcast 3 GB table of variants, used for masking • Break reads down to bases and map bases to covariates • Calculate empirical values per covariate • Broadcast observation, apply across reads
  • 9. Performance/Acc’y ADAM 0 10 20 30 40 50 GATK 0 10 20 30 40 50 • Fully concordant with Picard for MarkDup, >99% concordant with GATK for BQSR Hours 0 4 8 12 16 20 24 Sort Mark Duplicates BQSR Picard ADAM 100 EC2 Nodes
  • 10. Future Work • Pushing hard towards production release • Are building out a complete analysis pipeline • Plan to release Python bindings • Work on interoperability with Global Alliance for Genomic Health API (http:// genomicsandhealth.org/)
  • 11. Call for contributions • As an open source project, we welcome contributions • We maintain a list of open enhancements at our Github issue trackers • Github: https://www.github.com/bdgenomics • UC Berkeley is looking to hire two full time engineers to support this work
  • 12. Acknowledgements • UC Berkeley: Matt Massie,André Schumacher, Jey Kottalam, Christos Kozanitis • Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher • GenomeBridge: Timothy Danford, CarlYeksigian • The Broad Institute: Chris Hartl • Cloudera: Uri Laserson • Microsoft Research: Jeremy Elson, Ravi Pandya • Michael Heuer • And other open source contributors!
  • 13. Acknowledgements This research is supported in part by NSF CISE Expeditions Award CCF-1139158, LBNL Award 7076018, and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP, The Thomas and Stacey Siebel Foundation,Apple, Inc., C3Energy, Cisco, Cloudera, EMC, Ericsson, Facebook, GameOnTalis, Guavus, HP, Huawei, Intel, Microsoft, NetApp, Pivotal, Splunk,Virdata,VMware,WANdisco and Yahoo!.