SlideShare a Scribd company logo
Data analysis and integration
challenges in genomics
Uppsala
March 19, 2015
Mikael Huss, SciLifeLab / Stockholm
University
Where I work
INTEGRATIVEANDTECHNOLOGYDRIVENRESEARCHINHIGH-
THROUGHPUTBIOLOGY
SciLifeLab – an infrastructure for massive biology
Science 328,805 (14 May 2010)
 Inaugurated mid-2010
 Hosted by three universities in Stockholm:
Karolinska Institutet (medical faculty), Royal Institute
of Technology (technical) and Stockholm University
(natural science). SciLifeLab node in Uppsala.
 Approximately 700 researchers
 More than 100 researchers in bioinformatics and
systems biology
http://ngi-status.scilifelab.se/
National genomics facilities at SciLifeLab
Clinical Genomics Clinical biomarkers Clinical sequencing
Functional genomics
Eukaryotic Single Cell Genomics
Single Cell Proteomics
Microbial Single Cell Genomics
Karolinska High Throughput Center
(KHTC)
Bioimaging - Advanced Light Microscopy, Fluorescence Correlation Spectroscopy
Drug discovery – ADME, Antibody Therapeutics, Protein Expression &
Characterization, Lead Indetification, Biophysical Screning etc.
Chemical Biology Consortium Sweden – Umeå, Uppsala, KI
Structural Biology – Protein Science Facility
National facilities at SciLifeLab
Clinical diagnostics
Affinity proteomics
Biobank profiling, Cell profiling,
Fluorescence Tissue Profiling,
Mass Cytometry, PLA Proteomics,
Protein and Peptide Arrays,
Tissue Profiling
Bioinformatics facilities
• Bioinformatics compute and storage (UPPNEX)
• Short-term support (2 weeks / 80h) + paid extension
– About 45 FTEs
• Long-term support (500h) for projects selected by
external committee
“embedded
bioinformaticians”
Participate in projects on a
longer term basis
Long-term bioinformatics support group
• Currently 13 senior bioinformaticians + 2 managers
• Currently recruiting for 10 new employees and
thereby expanding from Uppsala and Stockholm to
other locations in Sweden
• Example projects (from my own work):
– Characterizing the human muscle transcriptome in connection with exercise
– Metagenomics for looking at the connection between international travel and
antibiotic resistance
– Characterizing neural stem cells in developing mouse brain
– Small RNAs involved in the CRISPR/Cas9 system in bacteria
Integrative bioinformatics initiative
(“big data” project)
• Advertising for 4 positions, 2 in Gothenburg & 2 in
Stockholm
• More in-depth support, experimental planning,
method development
• Data integration
Pilot project
Connecting layers of information
DNA Whole-genome sequencing
Exome sequencing
CGH
Mutations, SNVs
Copy number variations
Structural variations
Gene fusions
RNA mRNA isoforms
Allele specific expression
Fusion transcripts
eQTLS
proteins
RNA-seq
Microarrays
High throughput mass
spectrometry
Protein isoforms
Post-translational modifications
Data analysis & integration challenges in genomics
My blog: Follow the Data
Machine learning, “big data”, “data science”, often in connection with life science
Published brief notes on APIs from One Codex, Google Genomics, SolveBio
Let’s get the ”big data” buzzword out of the way …
… but some people are willing to go out on a limb
“Where is the cut-off? The
line in the sand is 5TB of
unstructured data or 7.5-
10TB of structured data,
which cannot be reduced
any further”
(OLRAC SPS)
http://www.itweb.co.za/index.php?option=com_con
tent&view=article&id=111815
”There is no such thing as
biomedical big data”
(Will Bush, Vanderbilt
University Center for
Human Genetic Research)
http://gettinggeneticsdone.blogspot.se/2014/02/no-
such-thing-biomedical-bigdata.html
Genomics big data in context: Throughput
Data processed per day (terabytes)
Tb
SciLifeLabKing
NYSE
Sanger
Spotify BGI
Twitter
Facebook
Baidu
NSA
Google Ebay
Internet
World
1e+001e+021e+041e+06
S
Genomics big data in context: Storage
Data stored (petabytes)
pb
AZ
SciLifeLab
Spotify
Sanger
Novartis
Ebay
Facebook
Baidu
NSA
Google
110100100010000
Aside: Storage & processing frameworks
Hadoop, the standard solution for “big data” in industry, has not really caught on
in genomics … Why? Some ideas –
- Existing computing infrastructure is sufficient
- Or, focused on supercomputing solutions rather than commodity servers
- The programming/sysadmin skills and training are not there
- Many problems not parallelizable
- Not enough flexibility for ad hoc, exploratory analysis
Spark/ADAM, new framework enabling more interactive and in-memory-
oriented analysis
Genomics big data in context: Heterogeneity
“The size of the data is not the whole story.
If the data are uniform, they can almost always
be compressed and filtered with traditional
methods.
You do not get a ‘big data’ processing challenge
until other factors, such as variety, non-
uniformity and continuous growth, are added to
a large data set.”
(adapted from Aleksi Kallio)
Ideas on improving data integration
1. APIs to mitigate friction in data collection and preprocessing
2. Querying “by data set”
3. Leveraging advances in machine learning
So much public data out there!
APIs
Lowering barriers to entry with APIs (application programming interfaces; ways
for a computer program to automatically retrieve information in a defined
manner).
“80% of the time of a data scientist is spent finding and preparing the data”
APIs against good reference collections mitigate the hassle of looking for the right
data sources, handling different versions/releases, etc.
We should be able to ask questions such as:
“Which gene variants in a patient have been previously associated to a specific
disease?” <= addressed by SolveBio and Google Genomics (with the inclusion of
the Tute annotation db)
APIs
Other questions could be, e.g.:
“Which microorganisms are found in this tissue sample?” <= addressed by the One
Codex API
APIs
Other questions could be, e.g.:
“Which microorganisms are found in this tissue sample?” <= addressed by the One
Codex API
“Which genes are expressed exclusively in the parathyroid gland?”
APIs
Other questions could be, e.g.:
“Which microorganisms are found in this tissue sample?” <= addressed by the One
Codex API
“Which genes are expressed exclusively in the parathyroid gland?”
“What is the most similar expression dataset to this one that I am currently
working on?” <= partly addressed by NextBio (but it’s a commercial package!)
APIs
Other questions could be, e.g.:
“Which microorganisms are found in this tissue sample?” <= addressed by the One
Codex API
“Which genes are expressed exclusively in the parathyroid gland?”
“What is the most similar expression dataset to this one that I am currently
working on?” <= partly addressed by NextBio (but it’s a commercial package!)
“Download all available sequences for arthropoda and store them as FASTQ files”
<= addressed by bionode.io
APIs
Other questions could be, e.g.:
“Which microorganisms are found in this tissue sample?” <= addressed by the One
Codex API
“Which genes are expressed exclusively in the parathyroid gland?”
“What is the most similar expression dataset to this one that I am currently
working on?” <= partly addressed by NextBio (but it’s a commercial package!)
“Download all available sequences for arthropoda and store them as FASTQ files”
<= addressed by bionode.io
“Give me the publicly available RNA-seq sequences that support this peptide that I
found in mass spectrometry and which appears to have been translated from a
fusion transcript”
Data provenance
Researchers often want to look at processed data (avoiding the work of
reprocessing everything from scratch) but they want to know how the processing
was done.
 Each data set should have an “analysis history” attached
Also important for reproducibility and paper writing
Querying by data set
Querying by dataset – we often want to relate our dataset to something “out
there” without necessarily having a good preconception what it could be.
(especially in metagenomics!)
NextBio does an interesting version of this but costs money (has been acquired by
Illumina) and focuses on selected types of functional studies.
Querying by dataset
Querying by dataset – we often want to relate our dataset to something “out
there” without necessarily having a good preconception what it could be.
(especially in metagenomics!)
NextBio does an interesting version of this but costs money (has been acquired by
Illumina) and focuses on selected types of functional studies.
Using the dataset itself, or a statistical
description of it, as a query
Jeff Jonas:
“Data finds data”
“The data is the query”
“we want to support automated data exploration in ways that are simply not possible today”
C Titus Brown (http://ivory.idyll.org/blog/2014-moore-ddd-round2-final.html)
Cumulative biology and metagenomics:
The unknown
http://www.ted.com/talks/nathan_wolfe_wha
t_s_left_to_explore.html
“Biological dark matter”
“The unknown continent”
According to one estimate,
less than 1% of the viral
diversity has been explored!
=> Reference databases very limited!
The unknown
In a recent paper on soil metagenomics, Titus Brown and colleagues report that:
 80% of the 398 billion sequences they obtained could not be assembled into
putative genes
 Of the cases where sequences could be assembled into putative genes which
would create putative proteins, 60% of these proteins could not be matched
to anything in the databases!
Ergo…
For metagenomics in particular, but also for other applications, we would like
to have everything that has been published indexed in a better way, so we can
relate new stuff to those. We need to have a constantly growing index.
When we perform a new experiment, we could then relate our results to all of
the data out there, not just the part that has made it into the official reference
databases.
Machine learning
Google has had great success with deep learning …
Learning to recognize cats from
unlabel Youtube videos (2012)
Neural network with “3 million
neurons and 1 billion synapses”
…now it’s all over the place
Inaugural Stockholm deep learning meetup,
March 10, 2015
Deep learning
Perhaps deep learning could be used in genomics, proteomics etc to
transform diverse data sets into a more general representation which would
facilitate data integration?
New datasets can then be overlaid onto representations trained on large
collections.
Deep learning in genomics (1)
How do gene expression patterns relate to cell type and state? Hard problem to
classify expression profiles into cell types because it is really a hierarchy where
different genes are important at different levels of the hierarchy
We may be starting to accumulate enough data to enable a deep learning
approach to learn a hierarchical representation of cell state based on expression
profiles (particularly with all the single-cell RNA-seq data now coming out)
Deep learning in genomics (1)
How do gene expression patterns relate to cell type and state? Hard problem to
classify expression profiles into cell types because it is really a hierarchy where
different genes are important at different levels of the hierarchy
We may be starting to accumulate enough data to enable a deep learning
approach to learn a hierarchical representation of cell state based on expression
profiles (particularly with all the single-cell RNA-seq data now coming out)
First step: Casey Greene’s group (Dartmouth)
A denoising autoencoder learned a generalized
representation of breast cancer expression
profiles based on the METABRIC cohort (>2000
samples). Validated on TCGA.
The nodes in the net can be interpreted to stand
for different biological features.
Tan et al. (2015)
Deep learning in genomics (2)
Convolutional network for splice site detection
Reads the DNA sequence directly and abstracts into higher-level features.
This network learned patterns of splice sites
And also re-discovered the concept of codons
Hannes Bretschneider: http://www.psi.toronto.edu/~hannes/resources/MLCB2014-Presentation.pdf
“Classical” machine learning
Predictive modeling as a way to integrate information from different experimental assays.
Example: ongoing mouse neural development project
A number of genome-wide experiments have been done in developing spinal cord and
cortex; have measurements/genome-wide signals about:
- Gene expression (RNA-seq)
- Where the Sox2 transcription factor is bound in each tissue (ChIP-seq)
- How open/accessible the chromatin is (DNase-seq)
- Potential transcription factor binding sites (DNase footprints)
as well as some calculated features like certain interesting “DNA words” (transcription
factor binding motifs) and how conserved each stretch of DNA is between mice and other
organisms.
How to make some sense of all these data?
“Genome browser” view of genomic landscape around a gene
Gene
Conservation
Different data tracks
“Openness”
Sox2
binding
raw signal
peaks
(borrowed from Mark Gerstein)
We decided we are most interested in
understanding differences in gene
expression between spinal cord and
cortex neurons. Can the other
measurements help?
Progressively summarized and
abstracted the raw signals into blocks
with various features => matrix of
~20,000 genes x 13 features
Use machine learning techniques to
predict relative gene expression in
cortex/spinal cord based on these
features (ongoing…)
Indexing and querying technology such as Google’s can help genomics researchers by
e g
- Enabling programmatic access to published data (processed but with a known
analysis history) to lower the threshold for integrative analysis
- Allowing them to relate their datasets to other published data without overly
relying on curated reference databases (cumulative biology)
- Facilitating ingestion into machine learning (e g deep learning) systems for learning
general features of biological data from a very large set of samples
Recap
Data analysis & integration challenges in genomics
Extra slides

More Related Content

PPTX
Emerging challenges in data-intensive genomics
PPTX
Data analytics challenges in genomics
PPTX
VariantSpark a library for genomics by Lynn Langit
PDF
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
PDF
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
PDF
Introduction to Bioinformatics.
PPT
Folker Meyer: Metagenomic Data Annotation
PDF
Genome Big Data
Emerging challenges in data-intensive genomics
Data analytics challenges in genomics
VariantSpark a library for genomics by Lynn Langit
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Introduction to Bioinformatics.
Folker Meyer: Metagenomic Data Annotation
Genome Big Data

What's hot (20)

PDF
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
PDF
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
DOCX
Major biological nucleotide databases
PPT
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
PPTX
Bioinformatics
PPTX
Variant (SNPs/Indels) calling in DNA sequences, Part 1
PDF
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
PPTX
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
PPTX
Big data nebraska
PDF
Building bioinformatics resources for the global community
PPTX
Ensembl annotation
PDF
Introduction to 16S Microbiome Analysis
PPTX
BioInformatics Tools -Genomics , Proteomics and metablomics
PPT
Biodatabases 101220022654-phpapp02
PPT
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
PPT
Bioinformatics lecture 1
PDF
Bioinformatics
PDF
Michael Reich, GenomeSpace Workshop, fged_seattle_2013
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Major biological nucleotide databases
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
Bioinformatics
Variant (SNPs/Indels) calling in DNA sequences, Part 1
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Big data nebraska
Building bioinformatics resources for the global community
Ensembl annotation
Introduction to 16S Microbiome Analysis
BioInformatics Tools -Genomics , Proteomics and metablomics
Biodatabases 101220022654-phpapp02
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Bioinformatics lecture 1
Bioinformatics
Michael Reich, GenomeSpace Workshop, fged_seattle_2013
Ad

Viewers also liked (9)

PPTX
Deep learning with Tensorflow in R
PPTX
RNA-seq differential expression analysis
PPT
Comparing public RNA-seq data
PDF
X-omics Data Integration Challenges
PPT
Pd L1 3: Dr BÙI ĐẮC CHÍ
PPTX
Introduction to systems biology
PPT
OMICS tecnology
PPTX
PPTX
Protein ligand interaction.
Deep learning with Tensorflow in R
RNA-seq differential expression analysis
Comparing public RNA-seq data
X-omics Data Integration Challenges
Pd L1 3: Dr BÙI ĐẮC CHÍ
Introduction to systems biology
OMICS tecnology
Protein ligand interaction.
Ad

Similar to Data analysis & integration challenges in genomics (20)

PDF
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
PPTX
2015 genome-center
PDF
Advanced Bioinformatics for Genomics and BioData Driven Research
PPTX
2016 davis-plantbio
PPTX
2013 nas-ehs-data-integration-dc
PPTX
2014 aus-agta
PPTX
Cool Informatics Tools and Services for Biomedical Research
PPTX
2016 davis-biotech
PPTX
Bhasha_Bandhu_Sample_presentation_2.pptxFESGEWGASGASFASFASFAS
PDF
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...
DOC
V1_I1_2012_Paper5.doc
PPTX
bioinformatics presentation in the master presentation
PDF
GASCAN: A Novel Database for Gastric Cancer Genes and Primers
PDF
Accomplishments And Challenges In Bioinformatics
PPT
Introducción a la bioinformatica
PDF
Bioinformatics - Exam_Materials.pdf by uos
PPTX
Accelerate Pharmaceutical R&D with Big Data and MongoDB
PPTX
Accelerate pharmaceutical r&d with mongo db
PPSX
Big&open data challenges for smartcity-PIC2014 Shanghai
PPTX
EiTESAL eHealth Conference 14&15 May 2017
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
2015 genome-center
Advanced Bioinformatics for Genomics and BioData Driven Research
2016 davis-plantbio
2013 nas-ehs-data-integration-dc
2014 aus-agta
Cool Informatics Tools and Services for Biomedical Research
2016 davis-biotech
Bhasha_Bandhu_Sample_presentation_2.pptxFESGEWGASGASFASFASFAS
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...
V1_I1_2012_Paper5.doc
bioinformatics presentation in the master presentation
GASCAN: A Novel Database for Gastric Cancer Genes and Primers
Accomplishments And Challenges In Bioinformatics
Introducción a la bioinformatica
Bioinformatics - Exam_Materials.pdf by uos
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate pharmaceutical r&d with mongo db
Big&open data challenges for smartcity-PIC2014 Shanghai
EiTESAL eHealth Conference 14&15 May 2017

Recently uploaded (20)

PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
Sciences of Europe No 170 (2025)
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPT
protein biochemistry.ppt for university classes
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPTX
famous lake in india and its disturibution and importance
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
2. Earth - The Living Planet earth and life
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PDF
The scientific heritage No 166 (166) (2025)
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
HPLC-PPT.docx high performance liquid chromatography
Sciences of Europe No 170 (2025)
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
Taita Taveta Laboratory Technician Workshop Presentation.pptx
protein biochemistry.ppt for university classes
The KM-GBF monitoring framework – status & key messages.pptx
famous lake in india and its disturibution and importance
Introduction to Fisheries Biotechnology_Lesson 1.pptx
. Radiology Case Scenariosssssssssssssss
2. Earth - The Living Planet earth and life
Derivatives of integument scales, beaks, horns,.pptx
AlphaEarth Foundations and the Satellite Embedding dataset
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
The scientific heritage No 166 (166) (2025)
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf

Data analysis & integration challenges in genomics

  • 1. Data analysis and integration challenges in genomics Uppsala March 19, 2015 Mikael Huss, SciLifeLab / Stockholm University
  • 3. SciLifeLab – an infrastructure for massive biology Science 328,805 (14 May 2010)  Inaugurated mid-2010  Hosted by three universities in Stockholm: Karolinska Institutet (medical faculty), Royal Institute of Technology (technical) and Stockholm University (natural science). SciLifeLab node in Uppsala.  Approximately 700 researchers  More than 100 researchers in bioinformatics and systems biology
  • 5. Clinical Genomics Clinical biomarkers Clinical sequencing Functional genomics Eukaryotic Single Cell Genomics Single Cell Proteomics Microbial Single Cell Genomics Karolinska High Throughput Center (KHTC) Bioimaging - Advanced Light Microscopy, Fluorescence Correlation Spectroscopy Drug discovery – ADME, Antibody Therapeutics, Protein Expression & Characterization, Lead Indetification, Biophysical Screning etc. Chemical Biology Consortium Sweden – Umeå, Uppsala, KI Structural Biology – Protein Science Facility National facilities at SciLifeLab Clinical diagnostics Affinity proteomics Biobank profiling, Cell profiling, Fluorescence Tissue Profiling, Mass Cytometry, PLA Proteomics, Protein and Peptide Arrays, Tissue Profiling
  • 6. Bioinformatics facilities • Bioinformatics compute and storage (UPPNEX) • Short-term support (2 weeks / 80h) + paid extension – About 45 FTEs • Long-term support (500h) for projects selected by external committee “embedded bioinformaticians” Participate in projects on a longer term basis
  • 7. Long-term bioinformatics support group • Currently 13 senior bioinformaticians + 2 managers • Currently recruiting for 10 new employees and thereby expanding from Uppsala and Stockholm to other locations in Sweden • Example projects (from my own work): – Characterizing the human muscle transcriptome in connection with exercise – Metagenomics for looking at the connection between international travel and antibiotic resistance – Characterizing neural stem cells in developing mouse brain – Small RNAs involved in the CRISPR/Cas9 system in bacteria
  • 8. Integrative bioinformatics initiative (“big data” project) • Advertising for 4 positions, 2 in Gothenburg & 2 in Stockholm • More in-depth support, experimental planning, method development • Data integration
  • 9. Pilot project Connecting layers of information DNA Whole-genome sequencing Exome sequencing CGH Mutations, SNVs Copy number variations Structural variations Gene fusions RNA mRNA isoforms Allele specific expression Fusion transcripts eQTLS proteins RNA-seq Microarrays High throughput mass spectrometry Protein isoforms Post-translational modifications
  • 11. My blog: Follow the Data Machine learning, “big data”, “data science”, often in connection with life science Published brief notes on APIs from One Codex, Google Genomics, SolveBio
  • 12. Let’s get the ”big data” buzzword out of the way …
  • 13. … but some people are willing to go out on a limb “Where is the cut-off? The line in the sand is 5TB of unstructured data or 7.5- 10TB of structured data, which cannot be reduced any further” (OLRAC SPS) http://www.itweb.co.za/index.php?option=com_con tent&view=article&id=111815 ”There is no such thing as biomedical big data” (Will Bush, Vanderbilt University Center for Human Genetic Research) http://gettinggeneticsdone.blogspot.se/2014/02/no- such-thing-biomedical-bigdata.html
  • 14. Genomics big data in context: Throughput Data processed per day (terabytes) Tb SciLifeLabKing NYSE Sanger Spotify BGI Twitter Facebook Baidu NSA Google Ebay Internet World 1e+001e+021e+041e+06 S
  • 15. Genomics big data in context: Storage Data stored (petabytes) pb AZ SciLifeLab Spotify Sanger Novartis Ebay Facebook Baidu NSA Google 110100100010000
  • 16. Aside: Storage & processing frameworks Hadoop, the standard solution for “big data” in industry, has not really caught on in genomics … Why? Some ideas – - Existing computing infrastructure is sufficient - Or, focused on supercomputing solutions rather than commodity servers - The programming/sysadmin skills and training are not there - Many problems not parallelizable - Not enough flexibility for ad hoc, exploratory analysis Spark/ADAM, new framework enabling more interactive and in-memory- oriented analysis
  • 17. Genomics big data in context: Heterogeneity “The size of the data is not the whole story. If the data are uniform, they can almost always be compressed and filtered with traditional methods. You do not get a ‘big data’ processing challenge until other factors, such as variety, non- uniformity and continuous growth, are added to a large data set.” (adapted from Aleksi Kallio)
  • 18. Ideas on improving data integration 1. APIs to mitigate friction in data collection and preprocessing 2. Querying “by data set” 3. Leveraging advances in machine learning So much public data out there!
  • 19. APIs Lowering barriers to entry with APIs (application programming interfaces; ways for a computer program to automatically retrieve information in a defined manner). “80% of the time of a data scientist is spent finding and preparing the data” APIs against good reference collections mitigate the hassle of looking for the right data sources, handling different versions/releases, etc. We should be able to ask questions such as: “Which gene variants in a patient have been previously associated to a specific disease?” <= addressed by SolveBio and Google Genomics (with the inclusion of the Tute annotation db)
  • 20. APIs Other questions could be, e.g.: “Which microorganisms are found in this tissue sample?” <= addressed by the One Codex API
  • 21. APIs Other questions could be, e.g.: “Which microorganisms are found in this tissue sample?” <= addressed by the One Codex API “Which genes are expressed exclusively in the parathyroid gland?”
  • 22. APIs Other questions could be, e.g.: “Which microorganisms are found in this tissue sample?” <= addressed by the One Codex API “Which genes are expressed exclusively in the parathyroid gland?” “What is the most similar expression dataset to this one that I am currently working on?” <= partly addressed by NextBio (but it’s a commercial package!)
  • 23. APIs Other questions could be, e.g.: “Which microorganisms are found in this tissue sample?” <= addressed by the One Codex API “Which genes are expressed exclusively in the parathyroid gland?” “What is the most similar expression dataset to this one that I am currently working on?” <= partly addressed by NextBio (but it’s a commercial package!) “Download all available sequences for arthropoda and store them as FASTQ files” <= addressed by bionode.io
  • 24. APIs Other questions could be, e.g.: “Which microorganisms are found in this tissue sample?” <= addressed by the One Codex API “Which genes are expressed exclusively in the parathyroid gland?” “What is the most similar expression dataset to this one that I am currently working on?” <= partly addressed by NextBio (but it’s a commercial package!) “Download all available sequences for arthropoda and store them as FASTQ files” <= addressed by bionode.io “Give me the publicly available RNA-seq sequences that support this peptide that I found in mass spectrometry and which appears to have been translated from a fusion transcript”
  • 25. Data provenance Researchers often want to look at processed data (avoiding the work of reprocessing everything from scratch) but they want to know how the processing was done.  Each data set should have an “analysis history” attached Also important for reproducibility and paper writing
  • 26. Querying by data set Querying by dataset – we often want to relate our dataset to something “out there” without necessarily having a good preconception what it could be. (especially in metagenomics!) NextBio does an interesting version of this but costs money (has been acquired by Illumina) and focuses on selected types of functional studies.
  • 27. Querying by dataset Querying by dataset – we often want to relate our dataset to something “out there” without necessarily having a good preconception what it could be. (especially in metagenomics!) NextBio does an interesting version of this but costs money (has been acquired by Illumina) and focuses on selected types of functional studies. Using the dataset itself, or a statistical description of it, as a query Jeff Jonas: “Data finds data” “The data is the query” “we want to support automated data exploration in ways that are simply not possible today” C Titus Brown (http://ivory.idyll.org/blog/2014-moore-ddd-round2-final.html)
  • 28. Cumulative biology and metagenomics: The unknown http://www.ted.com/talks/nathan_wolfe_wha t_s_left_to_explore.html “Biological dark matter” “The unknown continent” According to one estimate, less than 1% of the viral diversity has been explored! => Reference databases very limited!
  • 29. The unknown In a recent paper on soil metagenomics, Titus Brown and colleagues report that:  80% of the 398 billion sequences they obtained could not be assembled into putative genes  Of the cases where sequences could be assembled into putative genes which would create putative proteins, 60% of these proteins could not be matched to anything in the databases!
  • 30. Ergo… For metagenomics in particular, but also for other applications, we would like to have everything that has been published indexed in a better way, so we can relate new stuff to those. We need to have a constantly growing index. When we perform a new experiment, we could then relate our results to all of the data out there, not just the part that has made it into the official reference databases.
  • 31. Machine learning Google has had great success with deep learning … Learning to recognize cats from unlabel Youtube videos (2012) Neural network with “3 million neurons and 1 billion synapses” …now it’s all over the place Inaugural Stockholm deep learning meetup, March 10, 2015
  • 32. Deep learning Perhaps deep learning could be used in genomics, proteomics etc to transform diverse data sets into a more general representation which would facilitate data integration? New datasets can then be overlaid onto representations trained on large collections.
  • 33. Deep learning in genomics (1) How do gene expression patterns relate to cell type and state? Hard problem to classify expression profiles into cell types because it is really a hierarchy where different genes are important at different levels of the hierarchy We may be starting to accumulate enough data to enable a deep learning approach to learn a hierarchical representation of cell state based on expression profiles (particularly with all the single-cell RNA-seq data now coming out)
  • 34. Deep learning in genomics (1) How do gene expression patterns relate to cell type and state? Hard problem to classify expression profiles into cell types because it is really a hierarchy where different genes are important at different levels of the hierarchy We may be starting to accumulate enough data to enable a deep learning approach to learn a hierarchical representation of cell state based on expression profiles (particularly with all the single-cell RNA-seq data now coming out) First step: Casey Greene’s group (Dartmouth) A denoising autoencoder learned a generalized representation of breast cancer expression profiles based on the METABRIC cohort (>2000 samples). Validated on TCGA. The nodes in the net can be interpreted to stand for different biological features. Tan et al. (2015)
  • 35. Deep learning in genomics (2) Convolutional network for splice site detection Reads the DNA sequence directly and abstracts into higher-level features. This network learned patterns of splice sites And also re-discovered the concept of codons Hannes Bretschneider: http://www.psi.toronto.edu/~hannes/resources/MLCB2014-Presentation.pdf
  • 36. “Classical” machine learning Predictive modeling as a way to integrate information from different experimental assays. Example: ongoing mouse neural development project A number of genome-wide experiments have been done in developing spinal cord and cortex; have measurements/genome-wide signals about: - Gene expression (RNA-seq) - Where the Sox2 transcription factor is bound in each tissue (ChIP-seq) - How open/accessible the chromatin is (DNase-seq) - Potential transcription factor binding sites (DNase footprints) as well as some calculated features like certain interesting “DNA words” (transcription factor binding motifs) and how conserved each stretch of DNA is between mice and other organisms. How to make some sense of all these data?
  • 37. “Genome browser” view of genomic landscape around a gene Gene Conservation Different data tracks “Openness” Sox2 binding raw signal peaks
  • 38. (borrowed from Mark Gerstein) We decided we are most interested in understanding differences in gene expression between spinal cord and cortex neurons. Can the other measurements help? Progressively summarized and abstracted the raw signals into blocks with various features => matrix of ~20,000 genes x 13 features Use machine learning techniques to predict relative gene expression in cortex/spinal cord based on these features (ongoing…)
  • 39. Indexing and querying technology such as Google’s can help genomics researchers by e g - Enabling programmatic access to published data (processed but with a known analysis history) to lower the threshold for integrative analysis - Allowing them to relate their datasets to other published data without overly relying on curated reference databases (cumulative biology) - Facilitating ingestion into machine learning (e g deep learning) systems for learning general features of biological data from a very large set of samples Recap