Genome Analysis Pipelines, Big Data Style

®
© 2015 MapR Technologies 1
®
© 2015 MapR Technologies
Allen Day, PhD // Chief Scientist @ MapR.com
2016.04.12, Big Data Everywhere

®
Agenda
•  Presentation Motivations
–  Data inertia, data local computing
•  Highlights of BigData solutions ecosystem
–  MapR, NoSQL, Spark
•  Biotech Analytics Use Cases
–  Transition from sensors to insights - population DBs
•  NoSQL performance
–  Cost savings
•  NoSQL cost structure
–  Legacy tools – integration
•  Spark wrappers

®
Data Inertia
•  Newton’s 1st Law of Motion (Law of Inertia)
•  “An object at rest stays at rest … unless acted
upon by an unbalanced force”
•  Force required to transport data increases with
data size and device latency
–  CPU < CPU caches < RAM < Disk/SSD < Network
bigger
faster

®
Data Inertia + Exponential Data Growth =>
Data Local “BigData” Computing
•  Traditional algorithm design moves data to the
executing program
–  High Perf Cluster + Storage Network (HPC+SAN)
•  Key insight – program proportionally much
smaller than data, thus easier to move.
•  Modern algorithm design moves executing
program to the data

®
Some BigData Tools
What is Spark?
•  Spark is a parallel computing framework that
allows a job to run on 1000s of computers as
easily as 1. No code changes required.
•  Makes good use of RAM and SSD storage
What is HBase?
•  HBase is a non-relational (NoSQL), distributed
database modeled on Google’s BigTable.
•  Provides highly scalable sustained and random
access to very large data sets

®
MapR Converged Platform for BigData

®
© 2015 MapR Technologies 7© 2015 MapR Technologies
®
Cost-Effective ETL (Novartis)

®
The Problem
•  Key step in data ingest for R&D handled
by enterprise data warehouse (EDW)
–  Video, Proteomics, NGS, Metagenomics
•  EDW at maximum capacity
–  Multiple rounds of software optimization
already done
–  Data still growing
•  Insight limiting (= career limiting)
bottleneck

®
Three Options
1.  No more insights / candidates
2.  Increase EDW size
–  Expensive
–  Known to not scale well
3.  Find a more scalable solution

®
Extract,
Load
Raw data:
•  Public/private
•  Compounds
•  Expression data
•  Genotype data
•  EHR data
•  …
Transform,
Load
Downstream
Analysis (R&D)
Original Flow – ELTL
Knowledge
graph
Data Warehouse

®
Simplified Analysis – EDW Strategy
•  Majority of EDW storage consumed by ELTL
processing
–  Caused by minority of code
(raw data transformations)
•  Increasing EDW capacity yields
sub-linear performance
–  poor division of labor

®
With ETL Offload
Raw data:
•  Compounds
•  EHR data
•  …
Extract,
Load
Transform,
Load
Knowledge
graph
Data Warehouse
Downstream
Analysis (R&D)
MapR

®
Simplified Analysis – MapR Strategy
•  Lower Cost per TB of increased ETL
capacity by replacing EDW with MapR
•  Scale-out architecture – linear spend
gives linear performance increase
•  Strategic advantage – next-gen
architecture for implementing new use
cases
–  Insights/time (and career) acceleration

®
Additionally…
Raw data:
•  Compounds
•  EHR data
•  …
Extract,
Load
Knowledge
graph
Data Warehouse
Downstream
Analysis (R&D)
MapRTransform,
Load

®
New Use Cases are Enabled
Raw data:
•  Public and private
•  Compounds
•  EHR data
•  …
Extract,
Load
Knowledge
graph
Data Warehouse
Downstream
Analysis (R&D)
New Use
Cases
MapR
Transform,
Load

®
®
NoSQL: Scalable Population DBs

®
Catalog genetic variants => find QTLs
•  Current public human cohort proposals
100K-1M individuals, >400% CAGR
•  Seed and livestock companies, same trend
•  Px/Dx biomarkers for PGx, reproductive
medicine, biometrics, etc.
•  Idea is to catalog genetic variants, find QTLs
•  Well studied problem, let’s take a look

®
Genome × Phenome Analysis
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
SPARSE Billion + Phenotypes
SPARSEBillion+Genotypes
For given population,
given SNP 𝛿, and
given phenotype ϕ:
Count the number
of occurrences as the
value of the matrix

®
Associate QTLs to variants via
Genome × Phenome Matrix Factorization
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Archetypal
Genotypes
(column
Eigenvector)
Archetypal
Phenotypes
(row Eigenvector)
Factorize w/
Spark &
MapR
•  Row Eigenvectors of X represent
–  Sets of related phenotypes (by SNP)
•  Column Eigenvectors of Y represent
–  Sets of related SNPS (by phenotype)

®
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Archetypal
Genotypes
(column
Eigenvector)
Archetypal
Phenotypes
(row Eigenvector)
Moreover… This is a generalized GWAS

®
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Archetypal
Genotypes
(column
Eigenvector)
Archetypal
Phenotypes
(row Eigenvector)
it’s PheWAS

®
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Archetypal
Genotypes
(column
Eigenvector)
Archetypal
Phenotypes
(row Eigenvector)
it’s PheWAS
NB: These calculations are mixed I/O
workload – require high-throughput
sustained read and low-latency random-
access
Proven MapR-DB use case: Aadhar
Biometric system, 1B humans biometrics

®
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Furthermore…

®
doc5
user5 user3 user1
doc3
doc1
If we change the labels…

®
doc5
user5 user3 user1
doc3
doc1
INTERESTS
BEHAVIORS
We have the core of Google / Facebook /
Twitter Ad Revenue Engine

®
®
Spark: Porting Legacy Pipelines

®
Alignment
Reference
Sequences
Aligned
Reads Downstream
Applications…
DNA Reads

®
Alignment
Reference
Sequences
DNA Reads
Aligned
Reads Downstream
Applications…
Align()

®
Possible Align() Outcomes
Unaligned
DNA Reads
Reference
Sequences
Single
Location
Reads
Multiple
Location
Reads
Unlocatable
Reads
Align()

®
Many-to-Many Relationship Between Reads and
Locations
•  Read1
•  Read2
•  Read3
•  Read4
•  NULL
•  LocationA
•  LocationB
•  LocationC
•  LocationD
•  LocationA
•  NULL
•  LocationE

®
Parallelizing Alignment
Unaligne
d DNA
Reads
Locations
Locations
Locations
Part1Part2Part3
Aligned
DNA
Reads
Align() Concat() Sort() Etc…Split()

®
Using HPC+SAN has Bottlenecks (GridEngine, Etc)
Part1Part2Part3
Volume Read
Bottleneck
Volume Write
Bottleneck
Read & Write
Bottleneck

®
Using Spark Eliminates Bottlenecks
Align() Concat() Sort()Split()

®
Bottom Level: Integration with Legacy Tools
Local I/O
Container
Legacy
Sub-process

®

®
•  No time today to look at code, but a deeper
slideshow of doing this with Bowtie aligner:
•  http://www.slideshare.net/allenday
•  https://github.com/allenday/spark-genome-
alignment-demo
Local I/O
Container
Legacy
Sub-process

®
Thanks! Questions?
@allenday, @mapr
aday@mapr.com
linkedin.com/in/allenday
slideshare.net/allenday

Genome Analysis Pipelines, Big Data Style

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Genome Analysis Pipelines, Big Data Style (20)

Recently uploaded (20)

Genome Analysis Pipelines, Big Data Style