SlideShare a Scribd company logo
2
Most read
7
Most read
17
Most read
Introduction to
machine learning
in genomics
BRIAN SCHILDER
BIOINFORMATICIAN II
RAJ LAB 08/21/2020
[ 1 ] N A S H F A M I L Y D E P A R T M E N T O F N E U R O S C I E N C E &
F R I E D M A N B R A I N I N S T I T U T E
[ 2 ] R O N A L D M . L O E B C E N T E R F O R A L Z H E I M E R ’ S D I S E A S E
[ 3 ] D E P A R T M E N T O F G E N E T I C S A N D G E N O M I C S C I E N C E S &
I C A H N I N S T I T U T E F O R D A T A S C I E N C E A N D G E N O M I C
T E C H N O L O G Y ,
[ 4 ] E S T E L L E A N D D A N I E L M A G G I N D E P A R T M E N T O F
N E U R O L O G Y
Approaches to making predictions
L Breiman, Statistical modeling: The two
cultures. Statistical Science. 16, 199–215 (2001).
Explicit modeling
(your brain learns x~y)
“I will predict y from x by
assuming relationships based on
my knowledge/the literature.”
Pros
Can utilize the
prevailing wisdom.
Highly interpretable
models.
Cons
Susceptible to bias/
assumptions/
arbitrary parameters.
May not explain the
variance very well.
Machine learning
(your computer learns x~y)
“I will predict y from x by having
an algorithm learn the
relationships from data.”
Pros
Less susceptible to
(some forms) of
human bias.
Can make
predictions from
complex/multi-
variate data.
Cons
Can be less
interpretable.
May not generalize
to other data.
• What’s the relationship
between x and y?
• If you do something to x,
what will happen to y?
Science in a nutshell
cells
What is machine learning?
Artificial
Intelligence
The automation of
tasks that normally
require human
intelligence.
Machine
Learning
Automated
optimization of
some function by
learning directly
from data (as
opposed to
following explicit
rules).
x > 4
If True y + z < 2
If True =
Go to Dr.
If False =
Go to ER
If False Stay home
vippng.com
General ML framework
Input
training
data
Output
predictions
Evaluate
accuracy
against real
answer
Adjust
model
1. Training phase
2. Testing phase
Output predictions
Input testing data
Supervised learning example
Input data
• Categorical
• Continuous
MODEL
• Logistic
regression
• Linear
regression
• GLMM
• SVM
• Neural
network
• Genetic
algorithm
• etc...
Output
prediction
• Categorical
• Continuous
Dog
(.04)
Cat
(.96)
Transform data
(or a gene
expression
vector…)
Correct!
+1
ML vs. statistics: an increasingly blurry line
◦ Math and statistics were developed well before the advent of computers.
◦ Modern computers enable rapid iterative processes (optimization, distribution simulation)
◦ Linear regression, PCA and t-SNE are all technically AI/ML, but we often don’t think of them that way anymore.
https://towardsdatascience.com/introduction-to-linear-regression-and-polynomial-regression-f8adc96f31cb
https://ai.googleblog.com/2018/06/realtime-tsne-visualizations-with.html
Linear regression PCA t-SNE
Ways we use ML in biology
• DGE
• GWAS
• LDScore
• Batch correction
• …
Regression
• PCA
• MDS
• t-SNE
• UMAP
• Manifold learning
• Autoencoders
• …
Dimensionality Reduction
• Centroid-based
• K-means
• Hierarchical
• Agglomerative
• Density-based
• DBSCAN
• Louvain
• Distribution-based
• Expectation-maximization
Clustering
• Co-expression
• Multi-omics
• Causality
• Temporal graphs
• Imputation
• …
•Networks
…and much more!
Which ML model do I use?
Complex
relationships?
Yes
Need high
interpretability?
Yes Simpler model
No Lots of data?
Yes
More complex
model
No Simpler model
No Simpler model
https://towardsdatascience.com/the-balance-
accuracy-vs-interpretability-1b3861408062
In practice, you try multiple models of
varying complexity and compare
performances.
Deep learning in genomics
What is deep learning?
Eraslan et al. 2019, Nature Genetics Review
Given their sequences (input)
are how probable is it that these
regions are binding motifs for
TF A (output)?
Given their sequences (input)
are how probable is it that these
regions are binding motifs for
TF A (output 1) or TF B
(output 2)?
Given their sequences (input 1)
and chromatin accessibility
profiles (input 2) are how
probable is it that these regions
are binding motifs for TF A
(output)?
node
2-1
node
2-2
node
3-1
node
2-3
node
3-2
Layer 2
(hidden)
Layer 3
(output)
node
1-1
node
1-2
Layer 1
(input)
Pros
• Extremely flexible framework.
• Highly parallelizable (GPUs).
• Able to learn complex, non-linear
features.
Cons
• Challenging to interpret.
• Can require lots of compute.
• [Usually] requires lots of data.
https://www.cybercontrols.org/neuralnetworks
Deep learning in genomics
So what exactly can you do
with deep learning?
Predict [x] from DNA sequence
Disease risk
Gene expression
Splicing
TF motifs
Epigenomic impact
DNA sequence
•primateAI
•Deep Structured Phenotype Network (DSPN)
•xpresso
•spliceAI
•Equivariant networks
•DeepSEA
•DeepFIGV
•Avocado
• In many cases, performance of
deep learning models
performed far better than other
approaches (e.g. heuristics,
SVM)
• That said, rigorous testing on
sufficiently different datasets
than were used in training is
key (but often difficult)
Why are ANN so useful for sequences?
◦ DNA sequences are really hard for
humans to understand.
◦ Especially true when considering
long sequences, or multi-scale non-
linear interactions.
◦ Artificial neural networks (ANN)
excel at complex feature learning (e.g.
image recognition).
◦ CNNs are great for learning
hierarchical features
◦ nostril < nose < face < cat
◦ Humans can then interrogate and
interpret these features.
Eraslan et al. 2019, Nature Genetics Review
Encoded
Input ANN model
Predict-
ion
Other data types
Denoise noisy data (e.g. scRNA-seq)
Deep count autoencoder (DCA)
◦ Eraslan et al. (2019), Nature Communications
Learning with AuToEncoder
◦ Badsha et al. (2020), Quantitative Biology
SAVER-X
◦ Wang et al. (preprint) bioRxiv
Dimensionality reduction
~ 70k PBMCs
Transcriptomes
Realistic inference
Latent space interpolation:
◦ [Conditional] variational autoencoders (VAE)
◦ [style transfer] Generative adversarial networks (GAN)
(Gómez-Bombarelli, et al. 2018)
scGen (Lotfollahi, Wolf, & Theis, 2019)(Pieters & Wiering,(biorxiv) 2018 )
Stimulated (e.g. IFN-β )
peripheral blood mononuclear cell (PBMC):
e.g. T/B/NK cells, monocytes
Drugs
Faces
Disease prediction
“…we developed an interpretable deep-learning framework, the
Deep Structured Phenotype Network (DSPN) (21). This model
combines a Deep Boltzmann Machine architecture with
conditional and lateral connections derived from the regulatory
network (50).”
Improvement over baseline (50%)
• Logistic predictor: 2.4-fold
• DSPN: 6-fold
• Captures non-linear interactions
When does deep learning fail?
When there’s not enough
training/testing data.
Can contribute to
overfitting; model
can’t translate to
other datasets.
When the data hasn’t been
preprocessed properly, or
has some other
uncorrected confound.
e.g. White label on
bottom of image
from disease-
specialty hospital.
When a high degree of
interpretability and
explainability are required.
e.g. Clinical
decision support.
When a simpler model can
do just as well for less
compute.
Always compare
performance to
other methods.
When you’re asking the
wrong question, or the
fitness function is
mispecified.
Requires domain
knowledge.
Deep learning references
Reviews
◦ J Zou et al., A primer on deep learning in genomics. Nature Genetics
(2018), doi:10.1038/s41588-018-0295-5.
◦ G Eraslan et al., Deep learning: new computational modelling
techniques for genomics. Nature Reviews Genetics (2019),
doi:10.1038/s41576-019-0122-6.
◦ TJ Cleophas et al., Machine Learning in Medicine. Circulation. 132,
1920–1930 (2015).
◦ P Baldi, Deep Learning in Biomedical Data Science. Annual Review
of Biomedical Data Science. 1, 181–205 (2018).
◦ R Miotto et al., Deep learning for healthcare: review, opportunities
and challenges. Briefings in Bioinformatics. 19, 1236–1246 (2017).
◦ VI Jurtz et al., An introduction to deep learning on biological
sequence data: Examples and solutions. Bioinformatics. 33, 3685–
3690 (2017).
◦ MKK Leung et al., Machine Learning in Genomic Medicine: A
Review of Computational Problems and Data Sets. Proceedings of the
IEEE. 104, 176–197 (2016).
◦ DSW Ho et al., Machine learning SNP based prediction for
precision medicine. Frontiers in Genetics. 10, 1–10 (2019).
◦ A Taylor-Weiner et al., Scaling computational genomics to millions
of individuals with GPUs. Genome Biology. 20, 1–5 (2019).
◦ L Breiman, Statistical modeling: The two cultures. Statistical Science.
16, 199–215 (2001).
◦ BS Ullman, Using neuroscience to develop artificial intelligence.
363, 692–694 (2019).
◦ A Marblestone et al., Towards an integration of deep learning and
neuroscience. 10, 1–41 (2016).
◦ Y Bengio et al., Towards Biologically Plausible Deep Learning
(2015), doi:10.1007/s13398-014-0173-7.2.
◦ KM Chen et al., Selene: a PyTorch-based deep learning library for
sequence-level data. Nature Methods. 16, 315–318 (2019).
Genomics
◦ Disease risk
◦ D Wang et al., Comprehensive functional genomic resource and integrative model for the
adult brain. Science, 1266 (2018).
◦ L Sundaram et al., Predicting the clinical impact of human mutation with deep neural
networks. Nature Genetics. 50, 1161–1170 (2018).
◦ Y Ding et al., A deep learning model to predict a diagnosis of Alzheimer disease by using
18 F-FDG PET of the brain. Radiology. 290, 456–464 (2019).
◦ I Klyuzhin et al., Use of deep convolutional neural networks to predict Parkinson’s disease
progression from DaTscan SPECT images. Journal of Nuclear Medicine. 59, 29 (2018).
◦ KK Dey et al., Evaluating the informativeness of deep learning annotations for human
complex diseases. bioRxiv, 784439 (2019).
◦ A Romagnoni et al., Comparative performances of machine learning methods for
classifying Crohn Disease patients using genome-wide genotyping data. Scientific Reports. 9,
1–18 (2019).
◦ CAC Montañez et al., Deep Learning Classification of Polygenic Obesity using Genome
Wide Association Study SNPs. Proceedings of the International Joint Conference on Neural
Networks. 2018-July (2018), doi:10.1109/IJCNN.2018.8489048.
◦ Gene expression
◦ V Agarwal et al., Predicting mRNA Abundance Directly from Genomic Sequence Using
Deep Convolutional Neural Networks ll Predicting mRNA Abundance Directly from
Genomic Sequence Using Deep Convolutional Neural Networks. Cell Reports. 31, 107663
(2020).
◦ X Li et al., The impact of rare variation on gene expression across tissues. Nature. 550,
239–243 (2017).
◦ JD Washburn et al., Evolutionarily informed deep learning methods for predicting relative
transcript abundance from DNA sequence. Proceedings of the National Academy of Sciences of
the United States of America. 116, 5542–5549 (2019).
◦ Y Zhang et al., Predicting Gene Expression from DNA Sequence using Residual Neural
Network. bioRxiv, in press, doi:10.1101/2020.06.21.163956.
◦ Epigenomics
◦ J Zhou et al., Predicting effects of noncoding variants with deep learning-based sequence
model. Nature Methods. 12, 931–934 (2015).
◦ GE Hoffman et al., Functional Interpretation of Genetic Variants Using Deep Learning
Predicts Impact on Epigenome. Nucleic Acids Research, 1–15 (2019).
◦ J Schreiber et al., Avocado: a multi-scale deep tensor factorization method learns a latent
representation of the human epigenome. Genome Biology. 21, 364976 (2018).
◦ Splicing
◦ K Jaganathan et al., Predicting Splicing from Primary Sequence with Deep Learning. Cell.
0, 535-548.e24 (2019).
◦ TF
◦ RC Brown et al., An equivariant Bayesian convolutional network predicts recombination
hotspots and accurately resolves binding motifs. Bioinformatics. 35, 2177–2184 (2019).
Transcriptomics
◦ G Eraslan et al., Single-cell RNA-seq denoising using a deep count
autoencoder. Nature Communications. 10, 1–14 (2019).
◦ M Lotfollahi et al., scGen predicts single-cell perturbation responses. Nature
Methods. 16, 715–721 (2019).
◦ M Colomé-Tatché et al., Statistical single cell multi-omics integration. Current
Opinion in Systems Biology. 7, 54–59 (2018).
◦ C Lin et al., Using neural networks for reducing the dimensions of single-cell
RNA-Seq data. Nucleic Acids Research. 45 (2017), doi:10.1093/nar/gkx681.
◦ GP Way et al., Bayesian deep learning for single-cell analysis. Nature Methods.
15, 1009–1010 (2018).
◦ R Lopez et al., Deep generative modeling for single-cell transcriptomics.
Nature Methods. 15, 1053–1058 (2018).
◦ J Wang et al., Data denoising with transfer learning in single-cell
transcriptomics. Nature Methods. 16 (2019), doi:10.1038/s41592-019-0537-1.
◦ OmicsMapNet: Transforming omics data to take advantage of Deep
Convolutional Neural Network for discovery.
Drug discovery
◦ R Gómez-Bombarelli et al., Automatic Chemical Design Using a Data-Driven Continuous
Representation of Molecules. ACS Central Science. 4, 268–276 (2018).
◦ CF Lipinski et al., Advances and Perspectives in Applying Deep Learning for Drug Design and
Discovery. 6, 1–6 (2019).
◦ L David et al., Applications of deep-learning in exploiting large-scale and heterogeneous
compound data in industrial pharmaceutical research. Frontiers in Pharmacology. 10, 1–16 (2019).
Imaging
◦ G Lee et al., Predicting Alzheimer’s disease progression using multi-modal
deep learning approach. Scientific Reports. 9, 1–12 (2019).
◦ H Chen et al., VoxResNet: Deep voxelwise residual networks for brain
segmentation from 3D MR images. NeuroImage. 170, 446–455 (2018).
◦ T Jo et al., Deep Learning in Alzheimer’s Disease: Diagnostic Classification
and Prognostic Prediction Using Neuroimaging Data. Frontiers in Aging
Neuroscience. 11 (2019), doi:10.3389/fnagi.2019.00220.
◦ A Iqbal et al., Developing a brain atlas through deep learning. Nature Machine
Intelligence. 1, 277–287 (2019).
◦ A Mahbod et al., Automatic brain segmentation using artificial neural
networks with shape context. Pattern Recognition Letters. 101, 74–79 (2018).
◦ P Kumar et al., U-SEGNET: Fully convolutional neural network based
automated brain tissue segmentation tool. arXiv (2018).

More Related Content

PPTX
Industrial fermentation
PDF
Demystifying Artificial Intelligence
PDF
An Introduction to Machine Learning and Genomics
PPT
Cells of immune system
PPTX
Dimension reduction techniques[Feature Selection]
PPTX
Gene identification using bioinformatic tools.pptx
PPTX
Dimensionality Reduction | Machine Learning | CloudxLab
PPTX
ARTIFICIAL INTELLIGENCE BASIC PPT
Industrial fermentation
Demystifying Artificial Intelligence
An Introduction to Machine Learning and Genomics
Cells of immune system
Dimension reduction techniques[Feature Selection]
Gene identification using bioinformatic tools.pptx
Dimensionality Reduction | Machine Learning | CloudxLab
ARTIFICIAL INTELLIGENCE BASIC PPT

What's hot (20)

PPTX
The Amazing Ways Artificial Intelligence Is Transforming Genomics and Gene Ed...
PPTX
Uses of Artificial Intelligence in Bioinformatics
PDF
Protein folding prediction using Alphafold 1
PPTX
AI in Bioinformatics
PDF
PPT2: Introduction of Machine Learning & Deep Learning and its types
PPTX
Deep learning health care
PPTX
Introduction to Machine Learning
PPTX
Machine Learning in Healthcare Diagnostics
PDF
An Overview to Protein bioinformatics
PDF
Machine learning Algorithms
PDF
Classification Based Machine Learning Algorithms
PPTX
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
PPTX
Tools of bioinforformatics by kk
PDF
Explainable AI (XAI) - A Perspective
PPTX
Classification and Regression
PDF
Bayesian inference
PPT
Machine learning
PDF
Principal Component Analysis
PPTX
Machine Learning for Disease Prediction
PDF
Understanding Bagging and Boosting
The Amazing Ways Artificial Intelligence Is Transforming Genomics and Gene Ed...
Uses of Artificial Intelligence in Bioinformatics
Protein folding prediction using Alphafold 1
AI in Bioinformatics
PPT2: Introduction of Machine Learning & Deep Learning and its types
Deep learning health care
Introduction to Machine Learning
Machine Learning in Healthcare Diagnostics
An Overview to Protein bioinformatics
Machine learning Algorithms
Classification Based Machine Learning Algorithms
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Tools of bioinforformatics by kk
Explainable AI (XAI) - A Perspective
Classification and Regression
Bayesian inference
Machine learning
Principal Component Analysis
Machine Learning for Disease Prediction
Understanding Bagging and Boosting
Ad

Similar to Ml in genomics (20)

PPTX
Genetic prediction using Machine Learning Techniques .pptx
PPTX
2016 bergen-sars
PPTX
2016 davis-plantbio
PDF
Predicting property prices with machine learning algorithms.pdf
PPTX
Discover How Scientific Data is Used for the Public Good with Natural Languag...
PPTX
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
PDF
Sample Work For Engineering Literature Review and Gap Identification
PPTX
FAIR & AI Ready KGs for Explainable Predictions
PPTX
ppt1 - Copy (1).pptx
PDF
Prediction of Euro 50 Using Back Propagation Neural Network (BPNN) and Geneti...
PDF
Rough Draft Essay. Rough Draft Examples - Writing a rough draft. Rough Draft....
PPTX
AI at GSK_Kim Branson_mHealth Israel
PDF
AstraZeneca - The promise of graphs & graph-based learning in drug discovery
PPT
Scientific applications of machine learning
PPTX
2016 davis-biotech
PPTX
Charleston Conference 2016
PDF
2019 Fall Series: Postdoc Seminars - Special Guest Lecture, There is a Kernel...
PDF
Lec.10 Dr Ahmed Elngar
PDF
Introduction to Data Science
PDF
Review_of_Deep_Learning_Algorithms_and_Architectures.pdf
Genetic prediction using Machine Learning Techniques .pptx
2016 bergen-sars
2016 davis-plantbio
Predicting property prices with machine learning algorithms.pdf
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
Sample Work For Engineering Literature Review and Gap Identification
FAIR & AI Ready KGs for Explainable Predictions
ppt1 - Copy (1).pptx
Prediction of Euro 50 Using Back Propagation Neural Network (BPNN) and Geneti...
Rough Draft Essay. Rough Draft Examples - Writing a rough draft. Rough Draft....
AI at GSK_Kim Branson_mHealth Israel
AstraZeneca - The promise of graphs & graph-based learning in drug discovery
Scientific applications of machine learning
2016 davis-biotech
Charleston Conference 2016
2019 Fall Series: Postdoc Seminars - Special Guest Lecture, There is a Kernel...
Lec.10 Dr Ahmed Elngar
Introduction to Data Science
Review_of_Deep_Learning_Algorithms_and_Architectures.pdf
Ad

Recently uploaded (20)

PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
AlphaEarth Foundations and the Satellite Embedding dataset
bbec55_b34400a7914c42429908233dbd381773.pdf
Cell Membrane: Structure, Composition & Functions
2. Earth - The Living Planet Module 2ELS
neck nodes and dissection types and lymph nodes levels
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
Taita Taveta Laboratory Technician Workshop Presentation.pptx
The KM-GBF monitoring framework – status & key messages.pptx
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Comparative Structure of Integument in Vertebrates.pptx
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
7. General Toxicologyfor clinical phrmacy.pptx
ECG_Course_Presentation د.محمد صقران ppt
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg

Ml in genomics

  • 1. Introduction to machine learning in genomics BRIAN SCHILDER BIOINFORMATICIAN II RAJ LAB 08/21/2020 [ 1 ] N A S H F A M I L Y D E P A R T M E N T O F N E U R O S C I E N C E & F R I E D M A N B R A I N I N S T I T U T E [ 2 ] R O N A L D M . L O E B C E N T E R F O R A L Z H E I M E R ’ S D I S E A S E [ 3 ] D E P A R T M E N T O F G E N E T I C S A N D G E N O M I C S C I E N C E S & I C A H N I N S T I T U T E F O R D A T A S C I E N C E A N D G E N O M I C T E C H N O L O G Y , [ 4 ] E S T E L L E A N D D A N I E L M A G G I N D E P A R T M E N T O F N E U R O L O G Y
  • 2. Approaches to making predictions L Breiman, Statistical modeling: The two cultures. Statistical Science. 16, 199–215 (2001). Explicit modeling (your brain learns x~y) “I will predict y from x by assuming relationships based on my knowledge/the literature.” Pros Can utilize the prevailing wisdom. Highly interpretable models. Cons Susceptible to bias/ assumptions/ arbitrary parameters. May not explain the variance very well. Machine learning (your computer learns x~y) “I will predict y from x by having an algorithm learn the relationships from data.” Pros Less susceptible to (some forms) of human bias. Can make predictions from complex/multi- variate data. Cons Can be less interpretable. May not generalize to other data. • What’s the relationship between x and y? • If you do something to x, what will happen to y? Science in a nutshell cells
  • 3. What is machine learning? Artificial Intelligence The automation of tasks that normally require human intelligence. Machine Learning Automated optimization of some function by learning directly from data (as opposed to following explicit rules). x > 4 If True y + z < 2 If True = Go to Dr. If False = Go to ER If False Stay home vippng.com
  • 4. General ML framework Input training data Output predictions Evaluate accuracy against real answer Adjust model 1. Training phase 2. Testing phase Output predictions Input testing data Supervised learning example Input data • Categorical • Continuous MODEL • Logistic regression • Linear regression • GLMM • SVM • Neural network • Genetic algorithm • etc... Output prediction • Categorical • Continuous Dog (.04) Cat (.96) Transform data (or a gene expression vector…) Correct! +1
  • 5. ML vs. statistics: an increasingly blurry line ◦ Math and statistics were developed well before the advent of computers. ◦ Modern computers enable rapid iterative processes (optimization, distribution simulation) ◦ Linear regression, PCA and t-SNE are all technically AI/ML, but we often don’t think of them that way anymore. https://towardsdatascience.com/introduction-to-linear-regression-and-polynomial-regression-f8adc96f31cb https://ai.googleblog.com/2018/06/realtime-tsne-visualizations-with.html Linear regression PCA t-SNE
  • 6. Ways we use ML in biology • DGE • GWAS • LDScore • Batch correction • … Regression • PCA • MDS • t-SNE • UMAP • Manifold learning • Autoencoders • … Dimensionality Reduction • Centroid-based • K-means • Hierarchical • Agglomerative • Density-based • DBSCAN • Louvain • Distribution-based • Expectation-maximization Clustering • Co-expression • Multi-omics • Causality • Temporal graphs • Imputation • … •Networks …and much more!
  • 7. Which ML model do I use? Complex relationships? Yes Need high interpretability? Yes Simpler model No Lots of data? Yes More complex model No Simpler model No Simpler model https://towardsdatascience.com/the-balance- accuracy-vs-interpretability-1b3861408062 In practice, you try multiple models of varying complexity and compare performances.
  • 8. Deep learning in genomics
  • 9. What is deep learning? Eraslan et al. 2019, Nature Genetics Review Given their sequences (input) are how probable is it that these regions are binding motifs for TF A (output)? Given their sequences (input) are how probable is it that these regions are binding motifs for TF A (output 1) or TF B (output 2)? Given their sequences (input 1) and chromatin accessibility profiles (input 2) are how probable is it that these regions are binding motifs for TF A (output)? node 2-1 node 2-2 node 3-1 node 2-3 node 3-2 Layer 2 (hidden) Layer 3 (output) node 1-1 node 1-2 Layer 1 (input) Pros • Extremely flexible framework. • Highly parallelizable (GPUs). • Able to learn complex, non-linear features. Cons • Challenging to interpret. • Can require lots of compute. • [Usually] requires lots of data.
  • 11. Deep learning in genomics
  • 12. So what exactly can you do with deep learning?
  • 13. Predict [x] from DNA sequence Disease risk Gene expression Splicing TF motifs Epigenomic impact DNA sequence •primateAI •Deep Structured Phenotype Network (DSPN) •xpresso •spliceAI •Equivariant networks •DeepSEA •DeepFIGV •Avocado • In many cases, performance of deep learning models performed far better than other approaches (e.g. heuristics, SVM) • That said, rigorous testing on sufficiently different datasets than were used in training is key (but often difficult)
  • 14. Why are ANN so useful for sequences? ◦ DNA sequences are really hard for humans to understand. ◦ Especially true when considering long sequences, or multi-scale non- linear interactions. ◦ Artificial neural networks (ANN) excel at complex feature learning (e.g. image recognition). ◦ CNNs are great for learning hierarchical features ◦ nostril < nose < face < cat ◦ Humans can then interrogate and interpret these features. Eraslan et al. 2019, Nature Genetics Review Encoded Input ANN model Predict- ion
  • 16. Denoise noisy data (e.g. scRNA-seq) Deep count autoencoder (DCA) ◦ Eraslan et al. (2019), Nature Communications Learning with AuToEncoder ◦ Badsha et al. (2020), Quantitative Biology SAVER-X ◦ Wang et al. (preprint) bioRxiv
  • 18. Transcriptomes Realistic inference Latent space interpolation: ◦ [Conditional] variational autoencoders (VAE) ◦ [style transfer] Generative adversarial networks (GAN) (Gómez-Bombarelli, et al. 2018) scGen (Lotfollahi, Wolf, & Theis, 2019)(Pieters & Wiering,(biorxiv) 2018 ) Stimulated (e.g. IFN-β ) peripheral blood mononuclear cell (PBMC): e.g. T/B/NK cells, monocytes Drugs Faces
  • 19. Disease prediction “…we developed an interpretable deep-learning framework, the Deep Structured Phenotype Network (DSPN) (21). This model combines a Deep Boltzmann Machine architecture with conditional and lateral connections derived from the regulatory network (50).” Improvement over baseline (50%) • Logistic predictor: 2.4-fold • DSPN: 6-fold • Captures non-linear interactions
  • 20. When does deep learning fail? When there’s not enough training/testing data. Can contribute to overfitting; model can’t translate to other datasets. When the data hasn’t been preprocessed properly, or has some other uncorrected confound. e.g. White label on bottom of image from disease- specialty hospital. When a high degree of interpretability and explainability are required. e.g. Clinical decision support. When a simpler model can do just as well for less compute. Always compare performance to other methods. When you’re asking the wrong question, or the fitness function is mispecified. Requires domain knowledge.
  • 21. Deep learning references Reviews ◦ J Zou et al., A primer on deep learning in genomics. Nature Genetics (2018), doi:10.1038/s41588-018-0295-5. ◦ G Eraslan et al., Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics (2019), doi:10.1038/s41576-019-0122-6. ◦ TJ Cleophas et al., Machine Learning in Medicine. Circulation. 132, 1920–1930 (2015). ◦ P Baldi, Deep Learning in Biomedical Data Science. Annual Review of Biomedical Data Science. 1, 181–205 (2018). ◦ R Miotto et al., Deep learning for healthcare: review, opportunities and challenges. Briefings in Bioinformatics. 19, 1236–1246 (2017). ◦ VI Jurtz et al., An introduction to deep learning on biological sequence data: Examples and solutions. Bioinformatics. 33, 3685– 3690 (2017). ◦ MKK Leung et al., Machine Learning in Genomic Medicine: A Review of Computational Problems and Data Sets. Proceedings of the IEEE. 104, 176–197 (2016). ◦ DSW Ho et al., Machine learning SNP based prediction for precision medicine. Frontiers in Genetics. 10, 1–10 (2019). ◦ A Taylor-Weiner et al., Scaling computational genomics to millions of individuals with GPUs. Genome Biology. 20, 1–5 (2019). ◦ L Breiman, Statistical modeling: The two cultures. Statistical Science. 16, 199–215 (2001). ◦ BS Ullman, Using neuroscience to develop artificial intelligence. 363, 692–694 (2019). ◦ A Marblestone et al., Towards an integration of deep learning and neuroscience. 10, 1–41 (2016). ◦ Y Bengio et al., Towards Biologically Plausible Deep Learning (2015), doi:10.1007/s13398-014-0173-7.2. ◦ KM Chen et al., Selene: a PyTorch-based deep learning library for sequence-level data. Nature Methods. 16, 315–318 (2019). Genomics ◦ Disease risk ◦ D Wang et al., Comprehensive functional genomic resource and integrative model for the adult brain. Science, 1266 (2018). ◦ L Sundaram et al., Predicting the clinical impact of human mutation with deep neural networks. Nature Genetics. 50, 1161–1170 (2018). ◦ Y Ding et al., A deep learning model to predict a diagnosis of Alzheimer disease by using 18 F-FDG PET of the brain. Radiology. 290, 456–464 (2019). ◦ I Klyuzhin et al., Use of deep convolutional neural networks to predict Parkinson’s disease progression from DaTscan SPECT images. Journal of Nuclear Medicine. 59, 29 (2018). ◦ KK Dey et al., Evaluating the informativeness of deep learning annotations for human complex diseases. bioRxiv, 784439 (2019). ◦ A Romagnoni et al., Comparative performances of machine learning methods for classifying Crohn Disease patients using genome-wide genotyping data. Scientific Reports. 9, 1–18 (2019). ◦ CAC Montañez et al., Deep Learning Classification of Polygenic Obesity using Genome Wide Association Study SNPs. Proceedings of the International Joint Conference on Neural Networks. 2018-July (2018), doi:10.1109/IJCNN.2018.8489048. ◦ Gene expression ◦ V Agarwal et al., Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks ll Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks. Cell Reports. 31, 107663 (2020). ◦ X Li et al., The impact of rare variation on gene expression across tissues. Nature. 550, 239–243 (2017). ◦ JD Washburn et al., Evolutionarily informed deep learning methods for predicting relative transcript abundance from DNA sequence. Proceedings of the National Academy of Sciences of the United States of America. 116, 5542–5549 (2019). ◦ Y Zhang et al., Predicting Gene Expression from DNA Sequence using Residual Neural Network. bioRxiv, in press, doi:10.1101/2020.06.21.163956. ◦ Epigenomics ◦ J Zhou et al., Predicting effects of noncoding variants with deep learning-based sequence model. Nature Methods. 12, 931–934 (2015). ◦ GE Hoffman et al., Functional Interpretation of Genetic Variants Using Deep Learning Predicts Impact on Epigenome. Nucleic Acids Research, 1–15 (2019). ◦ J Schreiber et al., Avocado: a multi-scale deep tensor factorization method learns a latent representation of the human epigenome. Genome Biology. 21, 364976 (2018). ◦ Splicing ◦ K Jaganathan et al., Predicting Splicing from Primary Sequence with Deep Learning. Cell. 0, 535-548.e24 (2019). ◦ TF ◦ RC Brown et al., An equivariant Bayesian convolutional network predicts recombination hotspots and accurately resolves binding motifs. Bioinformatics. 35, 2177–2184 (2019). Transcriptomics ◦ G Eraslan et al., Single-cell RNA-seq denoising using a deep count autoencoder. Nature Communications. 10, 1–14 (2019). ◦ M Lotfollahi et al., scGen predicts single-cell perturbation responses. Nature Methods. 16, 715–721 (2019). ◦ M Colomé-Tatché et al., Statistical single cell multi-omics integration. Current Opinion in Systems Biology. 7, 54–59 (2018). ◦ C Lin et al., Using neural networks for reducing the dimensions of single-cell RNA-Seq data. Nucleic Acids Research. 45 (2017), doi:10.1093/nar/gkx681. ◦ GP Way et al., Bayesian deep learning for single-cell analysis. Nature Methods. 15, 1009–1010 (2018). ◦ R Lopez et al., Deep generative modeling for single-cell transcriptomics. Nature Methods. 15, 1053–1058 (2018). ◦ J Wang et al., Data denoising with transfer learning in single-cell transcriptomics. Nature Methods. 16 (2019), doi:10.1038/s41592-019-0537-1. ◦ OmicsMapNet: Transforming omics data to take advantage of Deep Convolutional Neural Network for discovery. Drug discovery ◦ R Gómez-Bombarelli et al., Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Central Science. 4, 268–276 (2018). ◦ CF Lipinski et al., Advances and Perspectives in Applying Deep Learning for Drug Design and Discovery. 6, 1–6 (2019). ◦ L David et al., Applications of deep-learning in exploiting large-scale and heterogeneous compound data in industrial pharmaceutical research. Frontiers in Pharmacology. 10, 1–16 (2019). Imaging ◦ G Lee et al., Predicting Alzheimer’s disease progression using multi-modal deep learning approach. Scientific Reports. 9, 1–12 (2019). ◦ H Chen et al., VoxResNet: Deep voxelwise residual networks for brain segmentation from 3D MR images. NeuroImage. 170, 446–455 (2018). ◦ T Jo et al., Deep Learning in Alzheimer’s Disease: Diagnostic Classification and Prognostic Prediction Using Neuroimaging Data. Frontiers in Aging Neuroscience. 11 (2019), doi:10.3389/fnagi.2019.00220. ◦ A Iqbal et al., Developing a brain atlas through deep learning. Nature Machine Intelligence. 1, 277–287 (2019). ◦ A Mahbod et al., Automatic brain segmentation using artificial neural networks with shape context. Pattern Recognition Letters. 101, 74–79 (2018). ◦ P Kumar et al., U-SEGNET: Fully convolutional neural network based automated brain tissue segmentation tool. arXiv (2018).

Editor's Notes

  • #3: predictive modeling Algorithmic modeling ~ machine learning In reality, there’s a lot of overlap between these approaches. For example, you can assume a normal distribution within a machine learning model.
  • #18: Pseudotime on DCA bottleneck coordinates was highly correlated with pseudotime on PCA coordinates (recommended usage) suggesting that DCA is capturing the correct continuous feature (cellular differentiation)