Data analysis & integration challenges in genomics

Data analysis and integration
challenges in genomics
Uppsala
March 19, 2015
Mikael Huss, SciLifeLab / Stockholm
University

Where I work
INTEGRATIVEANDTECHNOLOGYDRIVENRESEARCHINHIGH-
THROUGHPUTBIOLOGY

SciLifeLab – an infrastructure for massive biology
Science 328,805 (14 May 2010)
 Inaugurated mid-2010
 Hosted by three universities in Stockholm:
Karolinska Institutet (medical faculty), Royal Institute
of Technology (technical) and Stockholm University
(natural science). SciLifeLab node in Uppsala.
 Approximately 700 researchers
 More than 100 researchers in bioinformatics and
systems biology

http://ngi-status.scilifelab.se/
National genomics facilities at SciLifeLab

Clinical Genomics Clinical biomarkers Clinical sequencing
Functional genomics
Eukaryotic Single Cell Genomics
Single Cell Proteomics
Microbial Single Cell Genomics
Karolinska High Throughput Center
(KHTC)
Bioimaging - Advanced Light Microscopy, Fluorescence Correlation Spectroscopy
Drug discovery – ADME, Antibody Therapeutics, Protein Expression &
Characterization, Lead Indetification, Biophysical Screning etc.
Chemical Biology Consortium Sweden – Umeå, Uppsala, KI
Structural Biology – Protein Science Facility
National facilities at SciLifeLab
Clinical diagnostics
Affinity proteomics
Biobank profiling, Cell profiling,
Fluorescence Tissue Profiling,
Mass Cytometry, PLA Proteomics,
Protein and Peptide Arrays,
Tissue Profiling

Bioinformatics facilities
• Bioinformatics compute and storage (UPPNEX)
• Short-term support (2 weeks / 80h) + paid extension
– About 45 FTEs
• Long-term support (500h) for projects selected by
external committee
“embedded
bioinformaticians”
Participate in projects on a
longer term basis

Long-term bioinformatics support group
• Currently 13 senior bioinformaticians + 2 managers
• Currently recruiting for 10 new employees and
thereby expanding from Uppsala and Stockholm to
other locations in Sweden
• Example projects (from my own work):
– Characterizing the human muscle transcriptome in connection with exercise
– Metagenomics for looking at the connection between international travel and
antibiotic resistance
– Characterizing neural stem cells in developing mouse brain
– Small RNAs involved in the CRISPR/Cas9 system in bacteria

Integrative bioinformatics initiative
(“big data” project)
• Advertising for 4 positions, 2 in Gothenburg & 2 in
Stockholm
• More in-depth support, experimental planning,
method development
• Data integration

Pilot project
Connecting layers of information
DNA Whole-genome sequencing
Exome sequencing
CGH
Mutations, SNVs
Copy number variations
Structural variations
Gene fusions
RNA mRNA isoforms
Allele specific expression
Fusion transcripts
eQTLS
proteins
RNA-seq
Microarrays
High throughput mass
spectrometry
Protein isoforms
Post-translational modifications

Data analysis & integration challenges in genomics

My blog: Follow the Data
Machine learning, “big data”, “data science”, often in connection with life science
Published brief notes on APIs from One Codex, Google Genomics, SolveBio

Let’s get the ”big data” buzzword out of the way …

… but some people are willing to go out on a limb
“Where is the cut-off? The
line in the sand is 5TB of
unstructured data or 7.5-
10TB of structured data,
which cannot be reduced
any further”
(OLRAC SPS)
http://www.itweb.co.za/index.php?option=com_con
tent&view=article&id=111815
”There is no such thing as
biomedical big data”
(Will Bush, Vanderbilt
University Center for
Human Genetic Research)
http://gettinggeneticsdone.blogspot.se/2014/02/no-
such-thing-biomedical-bigdata.html

Genomics big data in context: Throughput
Data processed per day (terabytes)
Tb
SciLifeLabKing
NYSE
Sanger
Spotify BGI
Twitter
Facebook
Baidu
NSA
Google Ebay
Internet
World
1e+001e+021e+041e+06
S

Genomics big data in context: Storage
Data stored (petabytes)
pb
AZ
SciLifeLab
Spotify
Sanger
Novartis
Ebay
Facebook
Baidu
NSA
Google
110100100010000

Aside: Storage & processing frameworks
Hadoop, the standard solution for “big data” in industry, has not really caught on
in genomics … Why? Some ideas –
- Existing computing infrastructure is sufficient
- Or, focused on supercomputing solutions rather than commodity servers
- The programming/sysadmin skills and training are not there
- Many problems not parallelizable
- Not enough flexibility for ad hoc, exploratory analysis
Spark/ADAM, new framework enabling more interactive and in-memory-
oriented analysis

Genomics big data in context: Heterogeneity
“The size of the data is not the whole story.
If the data are uniform, they can almost always
be compressed and filtered with traditional
methods.
You do not get a ‘big data’ processing challenge
until other factors, such as variety, non-
uniformity and continuous growth, are added to
a large data set.”
(adapted from Aleksi Kallio)

Ideas on improving data integration
1. APIs to mitigate friction in data collection and preprocessing
2. Querying “by data set”
3. Leveraging advances in machine learning
So much public data out there!

APIs
Lowering barriers to entry with APIs (application programming interfaces; ways
for a computer program to automatically retrieve information in a defined
manner).
“80% of the time of a data scientist is spent finding and preparing the data”
APIs against good reference collections mitigate the hassle of looking for the right
data sources, handling different versions/releases, etc.
We should be able to ask questions such as:
“Which gene variants in a patient have been previously associated to a specific
disease?” <= addressed by SolveBio and Google Genomics (with the inclusion of
the Tute annotation db)

APIs
Other questions could be, e.g.:
“Which microorganisms are found in this tissue sample?” <= addressed by the One
Codex API

APIs
Codex API
“Which genes are expressed exclusively in the parathyroid gland?”

APIs
Codex API
“What is the most similar expression dataset to this one that I am currently
working on?” <= partly addressed by NextBio (but it’s a commercial package!)

APIs
Codex API
“Download all available sequences for arthropoda and store them as FASTQ files”
<= addressed by bionode.io

APIs
Codex API
“Download all available sequences for arthropoda and store them as FASTQ files”
<= addressed by bionode.io
“Give me the publicly available RNA-seq sequences that support this peptide that I
found in mass spectrometry and which appears to have been translated from a
fusion transcript”

Data provenance
Researchers often want to look at processed data (avoiding the work of
reprocessing everything from scratch) but they want to know how the processing
was done.
 Each data set should have an “analysis history” attached
Also important for reproducibility and paper writing

Querying by data set
Querying by dataset – we often want to relate our dataset to something “out
there” without necessarily having a good preconception what it could be.
(especially in metagenomics!)
NextBio does an interesting version of this but costs money (has been acquired by
Illumina) and focuses on selected types of functional studies.

Querying by dataset
Querying by dataset – we often want to relate our dataset to something “out
there” without necessarily having a good preconception what it could be.
(especially in metagenomics!)
NextBio does an interesting version of this but costs money (has been acquired by
Illumina) and focuses on selected types of functional studies.
Using the dataset itself, or a statistical
description of it, as a query
Jeff Jonas:
“Data finds data”
“The data is the query”
“we want to support automated data exploration in ways that are simply not possible today”
C Titus Brown (http://ivory.idyll.org/blog/2014-moore-ddd-round2-final.html)

Cumulative biology and metagenomics:
The unknown
http://www.ted.com/talks/nathan_wolfe_wha
t_s_left_to_explore.html
“Biological dark matter”
“The unknown continent”
According to one estimate,
less than 1% of the viral
diversity has been explored!
=> Reference databases very limited!

The unknown
In a recent paper on soil metagenomics, Titus Brown and colleagues report that:
 80% of the 398 billion sequences they obtained could not be assembled into
putative genes
 Of the cases where sequences could be assembled into putative genes which
would create putative proteins, 60% of these proteins could not be matched
to anything in the databases!

Ergo…
For metagenomics in particular, but also for other applications, we would like
to have everything that has been published indexed in a better way, so we can
relate new stuff to those. We need to have a constantly growing index.
When we perform a new experiment, we could then relate our results to all of
the data out there, not just the part that has made it into the official reference
databases.

Machine learning
Google has had great success with deep learning …
Learning to recognize cats from
unlabel Youtube videos (2012)
Neural network with “3 million
neurons and 1 billion synapses”
…now it’s all over the place
Inaugural Stockholm deep learning meetup,
March 10, 2015

Deep learning
Perhaps deep learning could be used in genomics, proteomics etc to
transform diverse data sets into a more general representation which would
facilitate data integration?
New datasets can then be overlaid onto representations trained on large
collections.

Deep learning in genomics (1)
How do gene expression patterns relate to cell type and state? Hard problem to
classify expression profiles into cell types because it is really a hierarchy where
different genes are important at different levels of the hierarchy
We may be starting to accumulate enough data to enable a deep learning
approach to learn a hierarchical representation of cell state based on expression
profiles (particularly with all the single-cell RNA-seq data now coming out)

How do gene expression patterns relate to cell type and state? Hard problem to
classify expression profiles into cell types because it is really a hierarchy where
different genes are important at different levels of the hierarchy
We may be starting to accumulate enough data to enable a deep learning
approach to learn a hierarchical representation of cell state based on expression
profiles (particularly with all the single-cell RNA-seq data now coming out)
First step: Casey Greene’s group (Dartmouth)
A denoising autoencoder learned a generalized
representation of breast cancer expression
profiles based on the METABRIC cohort (>2000
samples). Validated on TCGA.
The nodes in the net can be interpreted to stand
for different biological features.
Tan et al. (2015)

Convolutional network for splice site detection
Reads the DNA sequence directly and abstracts into higher-level features.
This network learned patterns of splice sites
And also re-discovered the concept of codons
Hannes Bretschneider: http://www.psi.toronto.edu/~hannes/resources/MLCB2014-Presentation.pdf

“Classical” machine learning
Predictive modeling as a way to integrate information from different experimental assays.
Example: ongoing mouse neural development project
A number of genome-wide experiments have been done in developing spinal cord and
cortex; have measurements/genome-wide signals about:
- Gene expression (RNA-seq)
- Where the Sox2 transcription factor is bound in each tissue (ChIP-seq)
- How open/accessible the chromatin is (DNase-seq)
- Potential transcription factor binding sites (DNase footprints)
as well as some calculated features like certain interesting “DNA words” (transcription
factor binding motifs) and how conserved each stretch of DNA is between mice and other
organisms.
How to make some sense of all these data?

“Genome browser” view of genomic landscape around a gene
Gene
Conservation
Different data tracks
“Openness”
Sox2
binding
raw signal
peaks

(borrowed from Mark Gerstein)
We decided we are most interested in
understanding differences in gene
expression between spinal cord and
cortex neurons. Can the other
measurements help?
Progressively summarized and
abstracted the raw signals into blocks
with various features => matrix of
~20,000 genes x 13 features
Use machine learning techniques to
predict relative gene expression in
cortex/spinal cord based on these
features (ongoing…)

Indexing and querying technology such as Google’s can help genomics researchers by
e g
- Enabling programmatic access to published data (processed but with a known
analysis history) to lower the threshold for integrative analysis
- Allowing them to relate their datasets to other published data without overly
relying on curated reference databases (cumulative biology)
- Facilitating ingestion into machine learning (e g deep learning) systems for learning
general features of biological data from a very large set of samples
Recap

Data analysis & integration challenges in genomics

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Data analysis & integration challenges in genomics (20)

Recently uploaded (20)

Data analysis & integration challenges in genomics