Building Genomic Data Processing and Machine Learning Workflows Using Apache Spark with Anupama Joshi and Matei egulescu

Epinomics
Anupama Joshi
Matei Negulescu
Genomic Data Processing and
Machine Learning Workflows
using Spark

Epinomics
• Supporting points go here.
A platform that drives personalized medicine by leveraging big data analytics and
proprietary epigenomic technology.

2%98%
Genes
What is Epigenomics?
Genomics
DNA is the hardware of the body:
static and descriptive (i.e. nature).
Epigenomics
Software layer: dynamically turns genes
on or off (i.e. nature and nurture).
Instructions encoded within non-coding sequence

Typical Genomic data
• Typical genomic
sequencing data
contains the
protein letters
ATCG .
• Most research
work focuses on
variation from
standard genome
sequences.

Epigenomic Data
Fragment Data
Single fragment where DNA was accessible during the
experiment.
chr1 713701 714600 +
chr1 804976 805650 +
Peaks Data
Aggregated regions of the genome where DNA was
accessible during the experiment.
chr1 713701 714600 peak.1 899 +
chr1 804976 805650 peak.2 674 +
Peaks of Accessibility

Genomic Data Growth
Stephens,et al., BigData: Astronomical or Genomical?(2015)
Moore’s law vs GenomicsData Acquisition (2015)

Goal: A Map of Human Health
Assessing data quality
Finding patterns in the data
– Clusters of similar data
– Significant differences between groups
– Finding unique fingerprints
Actionable Insight
– Diagnostics, new drugs, dosage, safety

Building Genomic Data Processing and Machine Learning Workflows Using Apache Spark with Anupama Joshi and Matei egulescu

Unsupervised Patterns of Accessibility
Process and
Consolidate Peaks
Store Peaks/Sample
Clustering
Samples based
on Peaks
Find Differences
between Sample
Groups

Peaks Processing
Each sample will have between 150K to 200K peaks
A typical biological experiment can have between 10 to 200 samples.
Consolidate and process overlapping peaks
Processed
Using Spark Graphx
A typical experiment will have between 300K to 600k overlapping peaks. (depending on
dataset and sequencing depth)
Source -:http://bedtools.readthedocs.io/

Peaks Processing
Merges overlapping peaks of two genomic ranges
vectors using GraphX library
Nodes are peaks and edges are overlaps

Unsupervised Learning
K-means and hierarchical clustering

Unsupervised Learning
Clustering similar datasets with PCA

d cells, termed Fast-
permeabilization and
n. We found that this
provides high-qual-
Fig. 1a–c), reduces
old (Supplementary
Fig. 1d), and offers an approximately 5-fold improvement in
fragment yield per cell (Supplementary Fig. 1e).
Using Fast-ATAC and RNA-seq, we profiled the chromatin
accessibility landscapes (regulomes) and transcriptomes of 13 dis-
tinct cellular populations from the human hematopoietic hierarchy
isolated via FACS (Fig. 1a and Supplementary Figs. 2–4). Cells were
f g
DNase CD34+
log2 (fragments)
4 6 8 10 12
ATACCD34
+
log2(fragments)
4
6
8
10
12
r = 0.73
HSC
log2 (fragments)
4 6 8 10 12
CD34+
log2(fragments)
4
6
8
10
12
r = 0.77
HSC patient 1
log2 (fragments)
6 8 10 12
r = 0.97
10 kb 10 kb 10 kb 10 kb 10 kb
CMP
MEP
Gran MegaEry
High-throughput
sequencing
c
ease
geny
TF
networks
AC
GATA2 LINC01272
CEBPB
GYPA BCL11B BLK
HSPC Monocyte Erythroid T/NK cell CLP/B cell
CLP
n = 5
Mono
n = 6
NK
n = 6
HSC
n = 7
MPP
n = 6
LMPP
n = 3
CMP
n = 8
GMP
n = 7
MEP
n = 7
CD4
n = 5
CD8
n = 5
B
n = 4
Ery
n = 8
Cell type
Number of
replicates
n = 2
CD34
ATAC
CD34
DNase
primary blood cells. (a) Schematic of the human hematopoietic hierarchy showing the 13 primary
megakaryocytes were excluded. The cell types comprising CD34+ HSPCs are indicated. Colors used in
s. Mono, monocyte; gran, granulocyte; ery, erythroid; mega, megakaryocyte; CD4, CD4+ T cell; CD8,
agram of the analyses performed using paired ATAC-seq and RNA-seq data in both primary human
c) Normalized ATAC-seq profiles at developmentally important genes. Profiles represent the union of
ype. See Supplementary Table 1 for the exact number of technical and biological replicates for each
chr. 3: 128,197,777–128,218,433; CEBPB, chr. 20: 48,800,260–48,904,715; GYPA, chr. 4:
,513,898–99,796,947; BLK, chr. 8: 11,343,117–11,429,285. All y-axis scales range from 0–10
icated by the scale bars. (d–g) Scatterplots showing correlation of technical replicates (d), different
erived from CD34+ HSPCs (f), and ATAC-seq data for HSCs and bulk CD34+ HSPCs (g). The r values
ks. Plots show 50,000 random peaks, each with at least five reads.
Corces et al. Lineage-specific andsingle-cell chromatin
accessibility charts humanhematopoiesis and leukemia
evolution
Epigenome of each cell-type is
unique fingerprint
Mixed sample’s signature can be
deconvolved into pure cell type signals
Supervised Learning – Cell composition

. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
x =
Sample signature
Number of reads
at specific sites
Cell type signature
Number of reads at
specific sites

Cell-type specific
Regions
Count Fragments in
these regions per
Sample
Deconvolve to
describe Cell-
type composition
Reference Regions Clinical sample with mixed cells Composition of Cells in Sample

Counting Reads – Range joins

Building a Personalized Medicine Workflow

Epinomics is building a map of human health through epigenomics.
ML pipelines combine Spark processing with traditional computing and
algorithms.
Spark helps to process tens of TB of genomic data for personalized
medicine applications.
Conclusion

Thank You.
Anupama Joshi – anupama.joshi@gmail.com
Matei Negulescu – mnegules@uwaterloo.ca

Building Genomic Data Processing and Machine Learning Workflows Using Apache Spark with Anupama Joshi and Matei egulescu

More Related Content

What's hot (20)

Similar to Building Genomic Data Processing and Machine Learning Workflows Using Apache Spark with Anupama Joshi and Matei egulescu (20)

More from Databricks (20)

Recently uploaded (20)

Building Genomic Data Processing and Machine Learning Workflows Using Apache Spark with Anupama Joshi and Matei egulescu