SlideShare a Scribd company logo
Epinomics
Anupama Joshi
Matei Negulescu
Genomic Data Processing and
Machine Learning Workflows
using Spark
Epinomics
• Supporting points go here.
A platform that drives personalized medicine by leveraging big data analytics and
proprietary epigenomic technology.
2%98%
Genes
What is Epigenomics?
Genomics
DNA is the hardware of the body:
static and descriptive (i.e. nature).
Epigenomics
Software layer: dynamically turns genes
on or off (i.e. nature and nurture).
Instructions encoded within non-coding sequence
Typical Genomic data
• Typical genomic
sequencing data
contains the
protein letters
ATCG .
• Most research
work focuses on
variation from
standard genome
sequences.
Epigenomic Data
Fragment Data
Single fragment where DNA was accessible during the
experiment.
chr1 713701 714600 +
chr1 804976 805650 +
Peaks Data
Aggregated regions of the genome where DNA was
accessible during the experiment.
chr1 713701 714600 peak.1 899 +
chr1 804976 805650 peak.2 674 +
Peaks of Accessibility
Genomic Data Growth
Stephens,et al., BigData: Astronomical or Genomical?(2015)
Moore’s law vs GenomicsData Acquisition (2015)
Data @ Epinomics
Goal: A Map of Human Health
Assessing data quality
Finding patterns in the data
– Clusters of similar data
– Significant differences between groups
– Finding unique fingerprints
Actionable Insight
– Diagnostics, new drugs, dosage, safety
Building Genomic Data Processing and Machine Learning Workflows Using Apache Spark with Anupama Joshi and Matei egulescu
Unsupervised Patterns of Accessibility
Process and
Consolidate Peaks
Store Peaks/Sample
Clustering
Samples based
on Peaks
Find Differences
between Sample
Groups
Peaks Processing
Each sample will have between 150K to 200K peaks
A typical biological experiment can have between 10 to 200 samples.
Consolidate and process overlapping peaks
Processed
Using Spark Graphx
A typical experiment will have between 300K to 600k overlapping peaks. (depending on
dataset and sequencing depth)
Source -:http://bedtools.readthedocs.io/
Peaks Processing
Merges overlapping peaks of two genomic ranges
vectors using GraphX library
Nodes are peaks and edges are overlaps
Unsupervised Learning
K-means and hierarchical clustering
Unsupervised Learning
Clustering similar datasets with PCA
Building Genomic Data Processing and Machine Learning Workflows Using Apache Spark with Anupama Joshi and Matei egulescu
d cells, termed Fast-
permeabilization and
n. We found that this
provides high-qual-
Fig. 1a–c), reduces
old (Supplementary
Fig. 1d), and offers an approximately 5-fold improvement in
fragment yield per cell (Supplementary Fig. 1e).
Using Fast-ATAC and RNA-seq, we profiled the chromatin
accessibility landscapes (regulomes) and transcriptomes of 13 dis-
tinct cellular populations from the human hematopoietic hierarchy
isolated via FACS (Fig. 1a and Supplementary Figs. 2–4). Cells were
f g
DNase CD34+
log2 (fragments)
4 6 8 10 12
ATACCD34
+
log2(fragments)
4
6
8
10
12
r = 0.73
HSC
log2 (fragments)
4 6 8 10 12
CD34+
log2(fragments)
4
6
8
10
12
r = 0.77
HSC patient 1
log2 (fragments)
6 8 10 12
r = 0.97
10 kb 10 kb 10 kb 10 kb 10 kb
CMP
MEP
Gran MegaEry
High-throughput
sequencing
c
ease
geny
TF
networks
AC
GATA2 LINC01272
CEBPB
GYPA BCL11B BLK
HSPC Monocyte Erythroid T/NK cell CLP/B cell
CLP
n = 5
Mono
n = 6
NK
n = 6
HSC
n = 7
MPP
n = 6
LMPP
n = 3
CMP
n = 8
GMP
n = 7
MEP
n = 7
CD4
n = 5
CD8
n = 5
B
n = 4
Ery
n = 8
Cell type
Number of
replicates
n = 2
CD34
ATAC
CD34
DNase
primary blood cells. (a) Schematic of the human hematopoietic hierarchy showing the 13 primary
megakaryocytes were excluded. The cell types comprising CD34+ HSPCs are indicated. Colors used in
s. Mono, monocyte; gran, granulocyte; ery, erythroid; mega, megakaryocyte; CD4, CD4+ T cell; CD8,
agram of the analyses performed using paired ATAC-seq and RNA-seq data in both primary human
c) Normalized ATAC-seq profiles at developmentally important genes. Profiles represent the union of
ype. See Supplementary Table 1 for the exact number of technical and biological replicates for each
chr. 3: 128,197,777–128,218,433; CEBPB, chr. 20: 48,800,260–48,904,715; GYPA, chr. 4:
,513,898–99,796,947; BLK, chr. 8: 11,343,117–11,429,285. All y-axis scales range from 0–10
icated by the scale bars. (d–g) Scatterplots showing correlation of technical replicates (d), different
erived from CD34+ HSPCs (f), and ATAC-seq data for HSCs and bulk CD34+ HSPCs (g). The r values
ks. Plots show 50,000 random peaks, each with at least five reads.
Corces et al. Lineage-specific andsingle-cell chromatin
accessibility charts humanhematopoiesis and leukemia
evolution
Epigenome of each cell-type is
unique fingerprint
Mixed sample’s signature can be
deconvolved into pure cell type signals
Supervised Learning – Cell composition
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
x =
Sample signature
Number of reads
at specific sites
Cell type signature
Number of reads at
specific sites
Supervised Learning – Cell composition
Cell-type specific
Regions
Count Fragments in
these regions per
Sample
Deconvolve to
describe Cell-
type composition
Reference Regions Clinical sample with mixed cells Composition of Cells in Sample
Supervised Learning – Cell composition
Counting Reads within Windows
Counting Reads within Windows
Counting Reads – Range joins
Building a Personalized Medicine Workflow
Epinomics is building a map of human health through epigenomics.
ML pipelines combine Spark processing with traditional computing and
algorithms.
Spark helps to process tens of TB of genomic data for personalized
medicine applications.
Conclusion
Thank You.
Anupama Joshi – anupama.joshi@gmail.com
Matei Negulescu – mnegules@uwaterloo.ca

More Related Content

PDF
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
PDF
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
PPTX
Data analysis & integration challenges in genomics
PPTX
171017 giab for giab grc workshop
PDF
A Genome Sequence Analysis System Built with Hypertable
PDF
Spark Summit EU talk by Erwin Datema and Roeland van Ham
PDF
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
PPTX
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Data analysis & integration challenges in genomics
171017 giab for giab grc workshop
A Genome Sequence Analysis System Built with Hypertable
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...

What's hot (20)

PDF
Genome Big Data
PDF
Jan2016 pac bio giab
PDF
Variant analysis and whole exome sequencing
PPTX
Lrg and mane 16 oct 2018
PDF
Scalable Genome Analysis With ADAM
PDF
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
PPTX
2018 1016 trio_binning_ashg_arhie_final
PPTX
Data analytics challenges in genomics
PPTX
Advancements in the human genome reference assembly (GRCh38)
PPTX
Emerging challenges in data-intensive genomics
PPTX
Explaining the assembly model
PDF
Michael Reich, GenomeSpace Workshop, fged_seattle_2013
PDF
Managing Genomics Data at the Sanger Institute
PDF
Finding Allelic Frequencies Using MapReduce/Hadoop
PPTX
Ashg2015 schneider final
PPTX
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
PDF
140127 platinum genomes pedigree analyses
PPTX
2016 bioinformatics i_wim_vancriekinge_vupload
Genome Big Data
Jan2016 pac bio giab
Variant analysis and whole exome sequencing
Lrg and mane 16 oct 2018
Scalable Genome Analysis With ADAM
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
2018 1016 trio_binning_ashg_arhie_final
Data analytics challenges in genomics
Advancements in the human genome reference assembly (GRCh38)
Emerging challenges in data-intensive genomics
Explaining the assembly model
Michael Reich, GenomeSpace Workshop, fged_seattle_2013
Managing Genomics Data at the Sanger Institute
Finding Allelic Frequencies Using MapReduce/Hadoop
Ashg2015 schneider final
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
140127 platinum genomes pedigree analyses
2016 bioinformatics i_wim_vancriekinge_vupload
Ad

Similar to Building Genomic Data Processing and Machine Learning Workflows Using Apache Spark with Anupama Joshi and Matei egulescu (20)

PDF
Whole Transcriptome Analysis of Testicular Germ Cell Tumors
PPT
Dna microarray mehran- u of toronto
PDF
Abrf poster2007
PDF
A novel platform for in situ, multiomic, hyper-plexed analyses of systems bio...
PDF
RNA-Seq To Identify Novel Markers For Research on Neural Tissue Differentiation
PPT
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
PDF
Chigot poster2007
PDF
Q pcr poster-20070314
PDF
Visual Exploration of Clinical and Genomic Data for Patient Stratification
PDF
Poster_GCP_Knapp
PDF
Custom AmpliSeq™ Panels for Inherited Disease Research from Optimized, Invent...
PDF
Aai 2007-pcr array-poster
PDF
Ascb 2007-pcr array-poster
PPTX
Data Management for Quantitative Biology - Data sources (Next generation tech...
PPT
Genomica - Microarreglos de DNA
PDF
New methods for high-throughput nucleic sequencing and diagnostics using a th...
PPTX
Axiom® Genome-Wide LAT 1 Array World Array 4
PDF
Variation graphs and population assisted genome inference copy
PDF
Grindberg - PNAS
PPSX
CDAC 2018 Boeva discovery
Whole Transcriptome Analysis of Testicular Germ Cell Tumors
Dna microarray mehran- u of toronto
Abrf poster2007
A novel platform for in situ, multiomic, hyper-plexed analyses of systems bio...
RNA-Seq To Identify Novel Markers For Research on Neural Tissue Differentiation
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Chigot poster2007
Q pcr poster-20070314
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Poster_GCP_Knapp
Custom AmpliSeq™ Panels for Inherited Disease Research from Optimized, Invent...
Aai 2007-pcr array-poster
Ascb 2007-pcr array-poster
Data Management for Quantitative Biology - Data sources (Next generation tech...
Genomica - Microarreglos de DNA
New methods for high-throughput nucleic sequencing and diagnostics using a th...
Axiom® Genome-Wide LAT 1 Array World Array 4
Variation graphs and population assisted genome inference copy
Grindberg - PNAS
CDAC 2018 Boeva discovery
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Global journeys: estimating international migration
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Computer network topology notes for revision
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Introduction to Business Data Analytics.
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Mega Projects Data Mega Projects Data
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Quality review (1)_presentation of this 21
Moving the Public Sector (Government) to a Digital Adoption
Global journeys: estimating international migration
IBA_Chapter_11_Slides_Final_Accessible.pptx
Launch Your Data Science Career in Kochi – 2025
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Computer network topology notes for revision
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Reliability_Chapter_ presentation 1221.5784
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Business Acumen Training GuidePresentation.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction to Business Data Analytics.
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Mega Projects Data Mega Projects Data

Building Genomic Data Processing and Machine Learning Workflows Using Apache Spark with Anupama Joshi and Matei egulescu

  • 1. Epinomics Anupama Joshi Matei Negulescu Genomic Data Processing and Machine Learning Workflows using Spark
  • 2. Epinomics • Supporting points go here. A platform that drives personalized medicine by leveraging big data analytics and proprietary epigenomic technology.
  • 3. 2%98% Genes What is Epigenomics? Genomics DNA is the hardware of the body: static and descriptive (i.e. nature). Epigenomics Software layer: dynamically turns genes on or off (i.e. nature and nurture). Instructions encoded within non-coding sequence
  • 4. Typical Genomic data • Typical genomic sequencing data contains the protein letters ATCG . • Most research work focuses on variation from standard genome sequences.
  • 5. Epigenomic Data Fragment Data Single fragment where DNA was accessible during the experiment. chr1 713701 714600 + chr1 804976 805650 + Peaks Data Aggregated regions of the genome where DNA was accessible during the experiment. chr1 713701 714600 peak.1 899 + chr1 804976 805650 peak.2 674 + Peaks of Accessibility
  • 6. Genomic Data Growth Stephens,et al., BigData: Astronomical or Genomical?(2015) Moore’s law vs GenomicsData Acquisition (2015)
  • 8. Goal: A Map of Human Health Assessing data quality Finding patterns in the data – Clusters of similar data – Significant differences between groups – Finding unique fingerprints Actionable Insight – Diagnostics, new drugs, dosage, safety
  • 10. Unsupervised Patterns of Accessibility Process and Consolidate Peaks Store Peaks/Sample Clustering Samples based on Peaks Find Differences between Sample Groups
  • 11. Peaks Processing Each sample will have between 150K to 200K peaks A typical biological experiment can have between 10 to 200 samples. Consolidate and process overlapping peaks Processed Using Spark Graphx A typical experiment will have between 300K to 600k overlapping peaks. (depending on dataset and sequencing depth) Source -:http://bedtools.readthedocs.io/
  • 12. Peaks Processing Merges overlapping peaks of two genomic ranges vectors using GraphX library Nodes are peaks and edges are overlaps
  • 13. Unsupervised Learning K-means and hierarchical clustering
  • 16. d cells, termed Fast- permeabilization and n. We found that this provides high-qual- Fig. 1a–c), reduces old (Supplementary Fig. 1d), and offers an approximately 5-fold improvement in fragment yield per cell (Supplementary Fig. 1e). Using Fast-ATAC and RNA-seq, we profiled the chromatin accessibility landscapes (regulomes) and transcriptomes of 13 dis- tinct cellular populations from the human hematopoietic hierarchy isolated via FACS (Fig. 1a and Supplementary Figs. 2–4). Cells were f g DNase CD34+ log2 (fragments) 4 6 8 10 12 ATACCD34 + log2(fragments) 4 6 8 10 12 r = 0.73 HSC log2 (fragments) 4 6 8 10 12 CD34+ log2(fragments) 4 6 8 10 12 r = 0.77 HSC patient 1 log2 (fragments) 6 8 10 12 r = 0.97 10 kb 10 kb 10 kb 10 kb 10 kb CMP MEP Gran MegaEry High-throughput sequencing c ease geny TF networks AC GATA2 LINC01272 CEBPB GYPA BCL11B BLK HSPC Monocyte Erythroid T/NK cell CLP/B cell CLP n = 5 Mono n = 6 NK n = 6 HSC n = 7 MPP n = 6 LMPP n = 3 CMP n = 8 GMP n = 7 MEP n = 7 CD4 n = 5 CD8 n = 5 B n = 4 Ery n = 8 Cell type Number of replicates n = 2 CD34 ATAC CD34 DNase primary blood cells. (a) Schematic of the human hematopoietic hierarchy showing the 13 primary megakaryocytes were excluded. The cell types comprising CD34+ HSPCs are indicated. Colors used in s. Mono, monocyte; gran, granulocyte; ery, erythroid; mega, megakaryocyte; CD4, CD4+ T cell; CD8, agram of the analyses performed using paired ATAC-seq and RNA-seq data in both primary human c) Normalized ATAC-seq profiles at developmentally important genes. Profiles represent the union of ype. See Supplementary Table 1 for the exact number of technical and biological replicates for each chr. 3: 128,197,777–128,218,433; CEBPB, chr. 20: 48,800,260–48,904,715; GYPA, chr. 4: ,513,898–99,796,947; BLK, chr. 8: 11,343,117–11,429,285. All y-axis scales range from 0–10 icated by the scale bars. (d–g) Scatterplots showing correlation of technical replicates (d), different erived from CD34+ HSPCs (f), and ATAC-seq data for HSCs and bulk CD34+ HSPCs (g). The r values ks. Plots show 50,000 random peaks, each with at least five reads. Corces et al. Lineage-specific andsingle-cell chromatin accessibility charts humanhematopoiesis and leukemia evolution Epigenome of each cell-type is unique fingerprint Mixed sample’s signature can be deconvolved into pure cell type signals Supervised Learning – Cell composition
  • 17. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x = Sample signature Number of reads at specific sites Cell type signature Number of reads at specific sites Supervised Learning – Cell composition
  • 18. Cell-type specific Regions Count Fragments in these regions per Sample Deconvolve to describe Cell- type composition Reference Regions Clinical sample with mixed cells Composition of Cells in Sample Supervised Learning – Cell composition
  • 21. Counting Reads – Range joins
  • 22. Building a Personalized Medicine Workflow
  • 23. Epinomics is building a map of human health through epigenomics. ML pipelines combine Spark processing with traditional computing and algorithms. Spark helps to process tens of TB of genomic data for personalized medicine applications. Conclusion
  • 24. Thank You. Anupama Joshi – anupama.joshi@gmail.com Matei Negulescu – mnegules@uwaterloo.ca