SlideShare a Scribd company logo
®
© 2015 MapR Technologies 1
®
© 2015 MapR Technologies
Allen Day, PhD // Chief Scientist @ MapR.com
2016.04.12, Big Data Everywhere
®
© 2015 MapR Technologies 2
Agenda
•  Presentation Motivations
–  Data inertia, data local computing
•  Highlights of BigData solutions ecosystem
–  MapR, NoSQL, Spark
•  Biotech Analytics Use Cases
–  Transition from sensors to insights - population DBs
•  NoSQL performance
–  Cost savings
•  NoSQL cost structure
–  Legacy tools – integration
•  Spark wrappers
®
© 2015 MapR Technologies 3
Data Inertia
•  Newton’s 1st Law of Motion (Law of Inertia)
•  “An object at rest stays at rest … unless acted
upon by an unbalanced force”
•  Force required to transport data increases with
data size and device latency
–  CPU < CPU caches < RAM < Disk/SSD < Network
bigger
faster
®
© 2015 MapR Technologies 4
Data Inertia + Exponential Data Growth =>
Data Local “BigData” Computing
•  Traditional algorithm design moves data to the
executing program
–  High Perf Cluster + Storage Network (HPC+SAN)
•  Key insight – program proportionally much
smaller than data, thus easier to move.
•  Modern algorithm design moves executing
program to the data
®
© 2015 MapR Technologies 5
Some BigData Tools
What is Spark?
•  Spark is a parallel computing framework that
allows a job to run on 1000s of computers as
easily as 1. No code changes required.
•  Makes good use of RAM and SSD storage
What is HBase?
•  HBase is a non-relational (NoSQL), distributed
database modeled on Google’s BigTable.
•  Provides highly scalable sustained and random
access to very large data sets
®
© 2015 MapR Technologies 6
MapR Converged Platform for BigData
®
© 2015 MapR Technologies 7© 2015 MapR Technologies
®
Cost-Effective ETL (Novartis)
®
© 2015 MapR Technologies 8
The Problem
•  Key step in data ingest for R&D handled
by enterprise data warehouse (EDW)
–  Video, Proteomics, NGS, Metagenomics
•  EDW at maximum capacity
–  Multiple rounds of software optimization
already done
–  Data still growing
•  Insight limiting (= career limiting)
bottleneck
®
© 2015 MapR Technologies 9
Three Options
1.  No more insights / candidates
2.  Increase EDW size
–  Expensive
–  Known to not scale well
3.  Find a more scalable solution
®
© 2015 MapR Technologies 10
Extract,
Load
Raw data:
•  Public/private
•  Compounds
•  Expression data
•  Genotype data
•  EHR data
•  …
Transform,
Load
Downstream
Analysis (R&D)
Original Flow – ELTL
Knowledge
graph
Data Warehouse
®
© 2015 MapR Technologies 11
Simplified Analysis – EDW Strategy
•  Majority of EDW storage consumed by ELTL
processing
–  Caused by minority of code
(raw data transformations)
•  Increasing EDW capacity yields
sub-linear performance
–  poor division of labor
®
© 2015 MapR Technologies 12
With ETL Offload
Raw data:
•  Public/private
•  Compounds
•  Expression data
•  Genotype data
•  EHR data
•  …
Extract,
Load
Transform,
Load
Knowledge
graph
Data Warehouse
Downstream
Analysis (R&D)
MapR
®
© 2015 MapR Technologies 13
Simplified Analysis – MapR Strategy
•  Lower Cost per TB of increased ETL
capacity by replacing EDW with MapR
•  Scale-out architecture – linear spend
gives linear performance increase
•  Strategic advantage – next-gen
architecture for implementing new use
cases
–  Insights/time (and career) acceleration
®
© 2015 MapR Technologies 14
Additionally…
Raw data:
•  Public/private
•  Compounds
•  Expression data
•  Genotype data
•  EHR data
•  …
Extract,
Load
Knowledge
graph
Data Warehouse
Downstream
Analysis (R&D)
MapRTransform,
Load
®
© 2015 MapR Technologies 15
New Use Cases are Enabled
Raw data:
•  Public and private
•  Compounds
•  Expression data
•  Genotype data
•  EHR data
•  …
Extract,
Load
Knowledge
graph
Data Warehouse
Downstream
Analysis (R&D)
New Use
Cases
MapR
Transform,
Load
®
© 2015 MapR Technologies 16© 2015 MapR Technologies
®
NoSQL: Scalable Population DBs
®
© 2015 MapR Technologies 17
Catalog genetic variants => find QTLs
•  Current public human cohort proposals
100K-1M individuals, >400% CAGR
•  Seed and livestock companies, same trend
•  Px/Dx biomarkers for PGx, reproductive
medicine, biometrics, etc.
•  Idea is to catalog genetic variants, find QTLs
•  Well studied problem, let’s take a look
®
© 2015 MapR Technologies 18
Genome × Phenome Analysis
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
SPARSE Billion + Phenotypes
SPARSEBillion+Genotypes
For given population,
given SNP 𝛿, and
given phenotype ϕ:
Count the number
of occurrences as the
value of the matrix
®
© 2015 MapR Technologies 19
Associate QTLs to variants via
Genome × Phenome Matrix Factorization
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Archetypal
Genotypes
(column
Eigenvector)
Archetypal
Phenotypes
(row Eigenvector)
Factorize w/
Spark &
MapR
•  Row Eigenvectors of X represent
–  Sets of related phenotypes (by SNP)
•  Column Eigenvectors of Y represent
–  Sets of related SNPS (by phenotype)
®
© 2015 MapR Technologies 20
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Archetypal
Genotypes
(column
Eigenvector)
Archetypal
Phenotypes
(row Eigenvector)
Moreover… This is a generalized GWAS
®
© 2015 MapR Technologies 21
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Archetypal
Genotypes
(column
Eigenvector)
Archetypal
Phenotypes
(row Eigenvector)
Moreover… This is a generalized GWAS
it’s PheWAS
®
© 2015 MapR Technologies 22
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Archetypal
Genotypes
(column
Eigenvector)
Archetypal
Phenotypes
(row Eigenvector)
Moreover… This is a generalized GWAS
it’s PheWAS
NB: These calculations are mixed I/O
workload – require high-throughput
sustained read and low-latency random-
access
Proven MapR-DB use case: Aadhar
Biometric system, 1B humans biometrics
®
© 2015 MapR Technologies 23
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Furthermore…
®
© 2015 MapR Technologies 24
doc5
user5 user3 user1
doc3
doc1
If we change the labels…
®
© 2015 MapR Technologies 25
doc5
user5 user3 user1
doc3
doc1
INTERESTS
BEHAVIORS
We have the core of Google / Facebook /
Twitter Ad Revenue Engine
®
© 2015 MapR Technologies 26
doc5
user5 user3 user1
doc3
doc1
INTERESTS
BEHAVIORS
We have the core of Google / Facebook /
Twitter Ad Revenue Engine
®
© 2015 MapR Technologies 27© 2015 MapR Technologies
®
Spark: Porting Legacy Pipelines
®
© 2015 MapR Technologies 28
Alignment
Reference
Sequences
Aligned
Reads Downstream
Applications…
DNA Reads
®
© 2015 MapR Technologies 29
Alignment
Reference
Sequences
DNA Reads
Aligned
Reads Downstream
Applications…
Align()
®
© 2015 MapR Technologies 30
Possible Align() Outcomes
Unaligned
DNA Reads
Reference
Sequences
Single
Location
Reads
Multiple
Location
Reads
Unlocatable
Reads
Align()
®
© 2015 MapR Technologies 31
Many-to-Many Relationship Between Reads and
Locations
•  Read1
•  Read2
•  Read3
•  Read4
•  NULL
•  LocationA
•  LocationB
•  LocationC
•  LocationD
•  LocationA
•  NULL
•  LocationE
®
© 2015 MapR Technologies 32
Parallelizing Alignment
Unaligne
d DNA
Reads
Locations
Locations
Locations
Part1Part2Part3
Aligned
DNA
Reads
Align() Concat() Sort() Etc…Split()
®
© 2015 MapR Technologies 33
Using HPC+SAN has Bottlenecks (GridEngine, Etc)
Part1Part2Part3
Volume Read
Bottleneck
Volume Write
Bottleneck
Read & Write
Bottleneck
®
© 2015 MapR Technologies 34
Using Spark Eliminates Bottlenecks
Align() Concat() Sort()Split()
®
© 2015 MapR Technologies 35
Bottom Level: Integration with Legacy Tools
Local I/O
Container
Legacy
Sub-process
®
© 2015 MapR Technologies 36
Bottom Level: Integration with Legacy Tools
®
© 2015 MapR Technologies 37
Bottom Level: Integration with Legacy Tools
•  No time today to look at code, but a deeper
slideshow of doing this with Bowtie aligner:
•  http://www.slideshare.net/allenday
•  https://github.com/allenday/spark-genome-
alignment-demo
Local I/O
Container
Legacy
Sub-process
®
© 2015 MapR Technologies 38
Thanks! Questions?
@allenday, @mapr
aday@mapr.com
linkedin.com/in/allenday
slideshare.net/allenday

More Related Content

PPTX
Predictive Analytics with Hadoop
PPTX
Lessons learned processing 70 billion data points a day using the hybrid cloud
PPTX
Compute-based sizing and system dashboard
PDF
Hadoop as a Platform for Genomics
PDF
Data Science Crash Course
PPTX
HPE and Hortonworks join forces to Deliver Healthcare Transformation
PPTX
MapR on Azure: Getting Value from Big Data in the Cloud -
PPTX
High Performance Computing and Big Data
Predictive Analytics with Hadoop
Lessons learned processing 70 billion data points a day using the hybrid cloud
Compute-based sizing and system dashboard
Hadoop as a Platform for Genomics
Data Science Crash Course
HPE and Hortonworks join forces to Deliver Healthcare Transformation
MapR on Azure: Getting Value from Big Data in the Cloud -
High Performance Computing and Big Data

What's hot (20)

PDF
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PDF
The Keys to Digital Transformation
PPTX
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
PPTX
Data Science Crash Course
PPTX
Depositing Value from Transactional Data at Danske Bank
PPTX
Big Data Analytics Using Hadoop
PPTX
Deep Learning vs. Cheap Learning
PDF
Insight Platforms Accelerate Digital Transformation
PDF
Apache Spark Crash Course
PPTX
Managing a Multi-Tenant Data Lake
PPTX
Building a Scalable Data Science Platform with R
PDF
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
PDF
Webinar turbo charging_data_science_hawq_on_hdp_final
PPTX
TESTING IN BIG DATA WORLD
PDF
Paris FOD Meetup #5 Hortonworks Presentation
PPTX
Modernise your EDW - Data Lake
PDF
Paris FOD Meetup #5 Cognizant Presentation
PPTX
High Performance Predictive Analytics in R and Hadoop
PDF
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
How Hadoop Makes the Natixis Pack More Efficient
The Keys to Digital Transformation
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Data Science Crash Course
Depositing Value from Transactional Data at Danske Bank
Big Data Analytics Using Hadoop
Deep Learning vs. Cheap Learning
Insight Platforms Accelerate Digital Transformation
Apache Spark Crash Course
Managing a Multi-Tenant Data Lake
Building a Scalable Data Science Platform with R
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
Webinar turbo charging_data_science_hawq_on_hdp_final
TESTING IN BIG DATA WORLD
Paris FOD Meetup #5 Hortonworks Presentation
Modernise your EDW - Data Lake
Paris FOD Meetup #5 Cognizant Presentation
High Performance Predictive Analytics in R and Hadoop
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Ad

Viewers also liked (20)

PDF
Genome Big Data
PDF
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
PPTX
Enabling the Connected Car Revolution

PDF
IonGAP - Uni of Westminster 23-10-2015
PDF
Construction Industry Review 8 2014
PPTX
The DNA of Data Quality and the Data Genome
PPTX
Case Study in Linked Data and Semantic Web: Human Genome
PPTX
Succes Story | Abomics
PPTX
Menestystarina | Abomics
PDF
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
PPTX
GEC 2017: JF Gauthier
PDF
Bioinformatics & Genomics December Newsletter
KEY
Big Data & the networked future of Science (at Ignite Seattle 7)
PDF
Visualizing the genome: Techniques for presenting genome data and annotations
PDF
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
PPT
Human Genome and Big Data Challenges
PPTX
Interviewing - why some questions are off limits
PPTX
Daily Snapshot - 2nd March 2017
PPTX
GEC 2017: Igor Oliveira
PPTX
GEC 2017: Hedda Pahlson-Moller
Genome Big Data
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Enabling the Connected Car Revolution

IonGAP - Uni of Westminster 23-10-2015
Construction Industry Review 8 2014
The DNA of Data Quality and the Data Genome
Case Study in Linked Data and Semantic Web: Human Genome
Succes Story | Abomics
Menestystarina | Abomics
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
GEC 2017: JF Gauthier
Bioinformatics & Genomics December Newsletter
Big Data & the networked future of Science (at Ignite Seattle 7)
Visualizing the genome: Techniques for presenting genome data and annotations
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
Human Genome and Big Data Challenges
Interviewing - why some questions are off limits
Daily Snapshot - 2nd March 2017
GEC 2017: Igor Oliveira
GEC 2017: Hedda Pahlson-Moller
Ad

Similar to Genome Analysis Pipelines, Big Data Style (20)

PPTX
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
PPTX
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
PPTX
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
PPTX
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
PPTX
Intro to Apache Spark by Marco Vasquez
PDF
How Big Data is Reducing Costs and Improving Outcomes in Health Care
PDF
Big Data and Implications on Platform Architecture
PPTX
Hadoop as a Platform for Genomics - Strata 2015, San Jose
PPTX
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
PDF
Massive Data Analysis- Challenges and Applications
PDF
Data Warehouse Evolution Roadshow
PPSX
Big&open data challenges for smartcity-PIC2014 Shanghai
PPTX
Webinar: Leveraging big data in life sciences & healthcare
DOCX
Big data (word file)
PDF
MapR & Skytree:
PPTX
Genomics isn't Special
PPTX
MapR and Machine Learning Primer
PPT
Introduction to Big Data An analogy between Sugar Cane & Big Data
PPTX
Steve Jenkins - Business Opportunities for Big Data in the Enterprise
PPTX
Big data Intro - Presentation to OCHackerz Meetup Group
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Intro to Apache Spark by Marco Vasquez
How Big Data is Reducing Costs and Improving Outcomes in Health Care
Big Data and Implications on Platform Architecture
Hadoop as a Platform for Genomics - Strata 2015, San Jose
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
Massive Data Analysis- Challenges and Applications
Data Warehouse Evolution Roadshow
Big&open data challenges for smartcity-PIC2014 Shanghai
Webinar: Leveraging big data in life sciences & healthcare
Big data (word file)
MapR & Skytree:
Genomics isn't Special
MapR and Machine Learning Primer
Introduction to Big Data An analogy between Sugar Cane & Big Data
Steve Jenkins - Business Opportunities for Big Data in the Enterprise
Big data Intro - Presentation to OCHackerz Meetup Group

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
A Presentation on Artificial Intelligence
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation theory and applications.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Big Data Technologies - Introduction.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
Empathic Computing: Creating Shared Understanding
Approach and Philosophy of On baking technology
NewMind AI Monthly Chronicles - July 2025
A Presentation on Artificial Intelligence
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation theory and applications.pdf
Review of recent advances in non-invasive hemoglobin estimation
The AUB Centre for AI in Media Proposal.docx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Big Data Technologies - Introduction.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Network Security Unit 5.pdf for BCA BBA.

Genome Analysis Pipelines, Big Data Style

  • 1. ® © 2015 MapR Technologies 1 ® © 2015 MapR Technologies Allen Day, PhD // Chief Scientist @ MapR.com 2016.04.12, Big Data Everywhere
  • 2. ® © 2015 MapR Technologies 2 Agenda •  Presentation Motivations –  Data inertia, data local computing •  Highlights of BigData solutions ecosystem –  MapR, NoSQL, Spark •  Biotech Analytics Use Cases –  Transition from sensors to insights - population DBs •  NoSQL performance –  Cost savings •  NoSQL cost structure –  Legacy tools – integration •  Spark wrappers
  • 3. ® © 2015 MapR Technologies 3 Data Inertia •  Newton’s 1st Law of Motion (Law of Inertia) •  “An object at rest stays at rest … unless acted upon by an unbalanced force” •  Force required to transport data increases with data size and device latency –  CPU < CPU caches < RAM < Disk/SSD < Network bigger faster
  • 4. ® © 2015 MapR Technologies 4 Data Inertia + Exponential Data Growth => Data Local “BigData” Computing •  Traditional algorithm design moves data to the executing program –  High Perf Cluster + Storage Network (HPC+SAN) •  Key insight – program proportionally much smaller than data, thus easier to move. •  Modern algorithm design moves executing program to the data
  • 5. ® © 2015 MapR Technologies 5 Some BigData Tools What is Spark? •  Spark is a parallel computing framework that allows a job to run on 1000s of computers as easily as 1. No code changes required. •  Makes good use of RAM and SSD storage What is HBase? •  HBase is a non-relational (NoSQL), distributed database modeled on Google’s BigTable. •  Provides highly scalable sustained and random access to very large data sets
  • 6. ® © 2015 MapR Technologies 6 MapR Converged Platform for BigData
  • 7. ® © 2015 MapR Technologies 7© 2015 MapR Technologies ® Cost-Effective ETL (Novartis)
  • 8. ® © 2015 MapR Technologies 8 The Problem •  Key step in data ingest for R&D handled by enterprise data warehouse (EDW) –  Video, Proteomics, NGS, Metagenomics •  EDW at maximum capacity –  Multiple rounds of software optimization already done –  Data still growing •  Insight limiting (= career limiting) bottleneck
  • 9. ® © 2015 MapR Technologies 9 Three Options 1.  No more insights / candidates 2.  Increase EDW size –  Expensive –  Known to not scale well 3.  Find a more scalable solution
  • 10. ® © 2015 MapR Technologies 10 Extract, Load Raw data: •  Public/private •  Compounds •  Expression data •  Genotype data •  EHR data •  … Transform, Load Downstream Analysis (R&D) Original Flow – ELTL Knowledge graph Data Warehouse
  • 11. ® © 2015 MapR Technologies 11 Simplified Analysis – EDW Strategy •  Majority of EDW storage consumed by ELTL processing –  Caused by minority of code (raw data transformations) •  Increasing EDW capacity yields sub-linear performance –  poor division of labor
  • 12. ® © 2015 MapR Technologies 12 With ETL Offload Raw data: •  Public/private •  Compounds •  Expression data •  Genotype data •  EHR data •  … Extract, Load Transform, Load Knowledge graph Data Warehouse Downstream Analysis (R&D) MapR
  • 13. ® © 2015 MapR Technologies 13 Simplified Analysis – MapR Strategy •  Lower Cost per TB of increased ETL capacity by replacing EDW with MapR •  Scale-out architecture – linear spend gives linear performance increase •  Strategic advantage – next-gen architecture for implementing new use cases –  Insights/time (and career) acceleration
  • 14. ® © 2015 MapR Technologies 14 Additionally… Raw data: •  Public/private •  Compounds •  Expression data •  Genotype data •  EHR data •  … Extract, Load Knowledge graph Data Warehouse Downstream Analysis (R&D) MapRTransform, Load
  • 15. ® © 2015 MapR Technologies 15 New Use Cases are Enabled Raw data: •  Public and private •  Compounds •  Expression data •  Genotype data •  EHR data •  … Extract, Load Knowledge graph Data Warehouse Downstream Analysis (R&D) New Use Cases MapR Transform, Load
  • 16. ® © 2015 MapR Technologies 16© 2015 MapR Technologies ® NoSQL: Scalable Population DBs
  • 17. ® © 2015 MapR Technologies 17 Catalog genetic variants => find QTLs •  Current public human cohort proposals 100K-1M individuals, >400% CAGR •  Seed and livestock companies, same trend •  Px/Dx biomarkers for PGx, reproductive medicine, biometrics, etc. •  Idea is to catalog genetic variants, find QTLs •  Well studied problem, let’s take a look
  • 18. ® © 2015 MapR Technologies 18 Genome × Phenome Analysis 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 SPARSE Billion + Phenotypes SPARSEBillion+Genotypes For given population, given SNP 𝛿, and given phenotype ϕ: Count the number of occurrences as the value of the matrix
  • 19. ® © 2015 MapR Technologies 19 Associate QTLs to variants via Genome × Phenome Matrix Factorization 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 Archetypal Genotypes (column Eigenvector) Archetypal Phenotypes (row Eigenvector) Factorize w/ Spark & MapR •  Row Eigenvectors of X represent –  Sets of related phenotypes (by SNP) •  Column Eigenvectors of Y represent –  Sets of related SNPS (by phenotype)
  • 20. ® © 2015 MapR Technologies 20 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 Archetypal Genotypes (column Eigenvector) Archetypal Phenotypes (row Eigenvector) Moreover… This is a generalized GWAS
  • 21. ® © 2015 MapR Technologies 21 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 Archetypal Genotypes (column Eigenvector) Archetypal Phenotypes (row Eigenvector) Moreover… This is a generalized GWAS it’s PheWAS
  • 22. ® © 2015 MapR Technologies 22 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 Archetypal Genotypes (column Eigenvector) Archetypal Phenotypes (row Eigenvector) Moreover… This is a generalized GWAS it’s PheWAS NB: These calculations are mixed I/O workload – require high-throughput sustained read and low-latency random- access Proven MapR-DB use case: Aadhar Biometric system, 1B humans biometrics
  • 23. ® © 2015 MapR Technologies 23 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 Furthermore…
  • 24. ® © 2015 MapR Technologies 24 doc5 user5 user3 user1 doc3 doc1 If we change the labels…
  • 25. ® © 2015 MapR Technologies 25 doc5 user5 user3 user1 doc3 doc1 INTERESTS BEHAVIORS We have the core of Google / Facebook / Twitter Ad Revenue Engine
  • 26. ® © 2015 MapR Technologies 26 doc5 user5 user3 user1 doc3 doc1 INTERESTS BEHAVIORS We have the core of Google / Facebook / Twitter Ad Revenue Engine
  • 27. ® © 2015 MapR Technologies 27© 2015 MapR Technologies ® Spark: Porting Legacy Pipelines
  • 28. ® © 2015 MapR Technologies 28 Alignment Reference Sequences Aligned Reads Downstream Applications… DNA Reads
  • 29. ® © 2015 MapR Technologies 29 Alignment Reference Sequences DNA Reads Aligned Reads Downstream Applications… Align()
  • 30. ® © 2015 MapR Technologies 30 Possible Align() Outcomes Unaligned DNA Reads Reference Sequences Single Location Reads Multiple Location Reads Unlocatable Reads Align()
  • 31. ® © 2015 MapR Technologies 31 Many-to-Many Relationship Between Reads and Locations •  Read1 •  Read2 •  Read3 •  Read4 •  NULL •  LocationA •  LocationB •  LocationC •  LocationD •  LocationA •  NULL •  LocationE
  • 32. ® © 2015 MapR Technologies 32 Parallelizing Alignment Unaligne d DNA Reads Locations Locations Locations Part1Part2Part3 Aligned DNA Reads Align() Concat() Sort() Etc…Split()
  • 33. ® © 2015 MapR Technologies 33 Using HPC+SAN has Bottlenecks (GridEngine, Etc) Part1Part2Part3 Volume Read Bottleneck Volume Write Bottleneck Read & Write Bottleneck
  • 34. ® © 2015 MapR Technologies 34 Using Spark Eliminates Bottlenecks Align() Concat() Sort()Split()
  • 35. ® © 2015 MapR Technologies 35 Bottom Level: Integration with Legacy Tools Local I/O Container Legacy Sub-process
  • 36. ® © 2015 MapR Technologies 36 Bottom Level: Integration with Legacy Tools
  • 37. ® © 2015 MapR Technologies 37 Bottom Level: Integration with Legacy Tools •  No time today to look at code, but a deeper slideshow of doing this with Bowtie aligner: •  http://www.slideshare.net/allenday •  https://github.com/allenday/spark-genome- alignment-demo Local I/O Container Legacy Sub-process
  • 38. ® © 2015 MapR Technologies 38 Thanks! Questions? @allenday, @mapr aday@mapr.com linkedin.com/in/allenday slideshare.net/allenday