SlideShare a Scribd company logo
Giovanni Bocci, Cristian Bologa, Daniel Byrd, Jayme Holmes, Stephen
Mathias, Oleg Ursu, Anna Waller, Jeremy Yang & Tudor Oprea
03/15/2019
INBRE-NMBIST Symposium
Santa Fe, NM Funding: NIH U24 CA224370 & NIH U24 TR002278
ILLUMINATING THE DRUGGABLE
GENOME WITH KNOWLEDGE
ENGINEERING AND MACHINE
LEARNING
datascience.unm.edupharos.nih.gov/idg/ druggablegenome.net
75% of protein research still
focused on 10% genes known
before human genome was mapped
AM Edwards et al, Nature, 2011
This prompted NIH to start the
Illuminating the Druggable Genome
Initiative (U54, Common Fund)
HGP
3
"If I have seen further it is by
standing on the shoulders of
Giants." - Isaac Newton, ~1675
Organization of this talk:
1. Shoulders development (knowledge engineering).
2. Seeing further efforts (machine learning).
4
IDG: AN INTERNATIONAL CONSORTIUM
5
http://hscnews.unm.edu/news/shedding-light-on-the-dark-genome
6
pharos.nih.gov/idg/ Omics
Dimensions
ML READY: PHAROS
11/13/18 revisionhttps://pharos.nih.gov/idg/targets/GRIN2A
7
ML READY: DRUGCENTRAL
10/16/18 revision
http://drugcentral.org/drugcard/1679
8
ML READY: HARMONIZOME
9
https://amp.pharm.mssm.edu/Harmonizome/
COMPONENTS OF IDG
https://druggablegenome.net/
DRGC
RDOC
IT
KMC
RFA-RM-16-026
(DRGC)
GPCR
U24 DK116195:
Bryan Roth, M.D., Ph.D. (UNC)
Brian Shoichet, Ph.D. (UCSF)
Ion
Channel
U24 DK116214:
Lily Jan, Ph.D. (UCSF)
Michael T. McManus, Ph.D. (UCSF)
Kinase
U24 DK116204:
Gary L. Johnson, Ph.D. (UNC)
RFA-RM-16-025
(RDOC)
U24 TR002278:
Stephan C. Schürer, Ph.D.  (UMiami)
Dusica Vidovic, Ph.D.  (UMiami)
Tudor Oprea, M.D., Ph.D.  (UNM)
Larry A. Sklar, Ph.D.  (UNM)
RFA-RM-16-024
(KMC)
U24 CA224260:
Avi Ma’ayan, Ph.D.  (ISMMS)
U24 CA224370:
Tudor Oprea, M.D., Ph.D. (UNM)
RFA-RM-18-011
(CEIT)
Awards starting date March 2019
Further information
Email: idg.rdoc@gmail.com
Follow: @DruggableGenome
URLs:
https://druggablegenome.net
/
https://commonfund.nih.gov/i
dg/
IDG Knowledge User-Interface
Email: pharos@mail.nih.gov
Follow: @IDG_Pharos
URL: https://pharos.nih.gov/
10
TARGET DEVELOPMENT LEVEL (TDL)
▪ Most protein classification schemes are
based on structural and functional criteria.
▪ For therapeutic development, it is useful to
understand how much and what types of
data are available for a given protein,
thereby highlighting well-studied and
understudied targets.
▪ Tclin: Proteins annotated as drug targets
▪ Tchem: Proteins for which potent small
molecules are known
▪ Tbio: Proteins for which biology is better
understood
▪ Tdark: These proteins lack antibodies,
publications or Gene RIFs
3/23/18 revision
T. Oprea et al., Nature Rev. Drug Discov. 2018,
https://www.nature.com/articles/nrd.2018.14
11
TDL LEVELS: Tclin and Tchem
▪ Tclin proteins are associated
with drug Mechanism of Action
(MoA) – NRDD 2017
▪ Tchem proteins have
bioactivitis in ChEMBL and
DrugCentral, + human curation
for some targets
▪ Kinases: <= 30nM
▪ GPCRs: <= 100nM
▪ Nuclear Receptors: <= 100nM
▪ Ion Channels: <= 10μM
▪ Non-IDG Family Targets: <= 1μM
10/19/16 revision
Bioactivities of approved drugs (by Target class)
ChEMBL: database of bioactive chemicals
https://www.ebi.ac.uk/chembl/
DrugCentral: online drug compendium
http://drugcentral.org/
R. Santos et al., Nature Rev. Drug Discov. 2017, https://www.nature.com/articles/nrd.2016.230
12
TDL LEVELS Tbio and Tdark
▪ Tbio proteins lack small molecule annotation cf. Tchem criteria,
and satisfy one of these criteria:
▪ protein is above the cutoff criteria for Tdark
▪ protein is annotated with a GO Molecular Function or Biological Process
leaf term(s) with an Experimental Evidence code
▪ protein has confirmed OMIM phenotype(s)
▪ Tdark (“ignorome”) have little information available, and satisfy
these criteria:
▪ PubMed text-mining score from Jensen Lab < 5
▪ <= 3 Gene RIFs
▪ <= 50 Antibodies available according to antibodypedia.com
13
TDL: EXTERNAL VALIDATION
Tdark parameters differ from the other TDLs across the 4 external
metrics cf. Kruskal-Wallis post-hoc pairwise Dunn tests
2/23/18 revision
T. Oprea et al., Nature Rev. Drug Discov. 2018,
https://www.nature.com/articles/nrd.2018.14
14
WHY FUND TDARK RESEARCH?
2/23/18 revision
T. Oprea et al., Nature Rev. Drug Discov. 2018,
https://www.nature.com/articles/nrd.2018.14
Typically, it takes 15-20 years for a Tdark protein to become druggable
15
IMPC BOLDLY GOES WHERE NO ONE
HAS GONE BEFORE
95% of eligible IDG genes
(339/356) have plans,
attempts, or models
384 genes were prioritized
by IDG KMC (2014-2016) 17
28
17
1
63
24
50
79
168306 Tbio genes
90 Tdark genes
42 Tchem genes
11/29/17 revision
Slide from Steve Murray, Jackson Lab 16
TAKE HOME MESSAGE:
THERE IS A
KNOWLEDGE DEFICIT
3/12/18 revision
~35% of the proteins remain
poorly described (Tdark)
~11% of the Proteome (Tclin & Tchem) are currently targeted
by small molecule probes
Choosing to work on dark genes is a high-risk endeavor
(Funders are less likely to award grants for Tdark)
CHALLENGE: RANKING & SCORING
PROTEIN-DISEASE ASSOCIATIONS
https://pharos-beta.ncats.io/targets/GRIN2A
The IDG KMC tracks more ~10 information
channels for protein-disease associations,
accessible via the Pharos portal.
Our challenge is to harmonize disease
concepts, and to enable computational
use: e.g., GRIN2A with GRIN1 form the
Glutamate NMDA receptor, MoA drug
target for memantine (Alzheimer’s).
The challenge for ML & AI: How to
prioritize targets? i.e., which
protein-disease associations are clinically
actionable?
10/07/18 revision
18
WHAT DO WE KNOW ABOUT
DISEASES?
▪ There are between 9,000 and 25,000 disease concepts
▪ Pharos/TCRD tracks ~11,000 disease via Disease
Ontology, and ~10500 rare disease via eRAM,
OrphaNet and the Monarch Initiative MONDO system
19
PROTEIN KNOWLEDGE GRAPHS
▪ IDG KMC2 seeks knowledge gaps
across the five branches of the
“knowledge tree”:
▪ Genotype; Phenotype; Interactions
& Pathways; Structure & Function;
and Expression, respectively.
▪ We can use biological systems
network modeling to infer novel
relationships based on available
evidence, and infer new “function”
and “role in disease” data based
on other layers of evidence
▪ Primary focus on Tdark & Tbio
O. Ursu, T Oprea et al., IDG2 KMC 2/01/18 revision
20
THE METAPATH-ML APPROACH▪ A metapath is a sequence of
relations defined between
different object types.
▪ Our metapaths encode
type-specific network topology
between the source node (Protein)
and the destination node
(Disease/Phenotype).
▪ This approach enables the
transformation of
assertions/evidence chains of
heterogeneous biological data
types into a ML ready format.
SOME REFS: G. Fu et al., BMC Bioinformatics 2016. D Himmelstein & S Baranzini, PLOS Comp Bio, 2015.
Similar assertions or evidence form metapaths (white).
Instances of metapath (paths) are used to determine the strength of the evidence linking a
gene to disease/phenotype/function.
21
22
SOME EARLY ACKNOWLEDGMENTS ...
Abstract: We hear a lot about machine learning and its role in health
care, but these methods require large amounts of training data. Using
these and other related method to study rare diseases poses
substantial challenges: how can we get tens of thousands of training
examples when there are tens or hundreds of people with a disease?
(Abstract from this
conference)
Scarce training data our
problem too. But with
genes instead of people.
Our MetapathML method
similar to
Himmelstein-Baranzini.
(Daniel is now post-doc in
Greene Lab.)
METAPATH-ML DATA SOURCES
O. Ursu et al., manuscript in preparation
Data source Data type Data points
CCLE Gene expression 19,006,134
GTEx Gene expression 2,612,227
Protein Atlas Gene & Protein expression 949,199
Reactome Biological pathways 303,681
KEGG Biological pathways 27,683
StringDB Protein-Protein interactions 5,080,023
Gene ontology Biological pathways & Gene function 434,317
InterPro Protein structure and function 467,163
ClinVar Human Gene - Disease/Phenotype associations 881,357
GWAS Gene - Disease/Phenotype associations 54,360
OMIM Human Gene - Disease/Phenotype associations 25,557
UniProt Disease Human Gene - Disease/Phenotype associations 5,365
JensenLab DISEASE Gene - Disease associations from text mining 44,829
NCBI Homology Homology mapping of human/mouse/rat genes 70,922
IMPC Mouse Gene - Phenotype associations 2,153,999
RGD Rat Gene - Phenotype associations 117,606
LINCS Drug induced gene signatures 230,111,315
We developed automated
methods for data collection
(TCRD), visualization (Pharos)
and data aggregation.
 
These aggregated datasets
were used to build machine
learning models for 20+
disease and 73 mouse
phenotype.
Each knowledge graph
contains ~22,000 metapaths
and 284 million path instances.
10/07/18 revision
23
METAPATH-ML WORKFLOW
▪ A meta-path encodes type-specific network topology between the source node
(e.g., Protein target) and the destination node (e.g., Disease or Function)
▪ Target –– (member of) → PPI Network ← (member of) –– Protein –– (associated
with) → Disease
▪ Target –– (expressed in) → Tissue ← (localized in) –– Disease
O. Ursu, T Oprea et al., IDG2 KMC 2/01/18 revision
24
METAPATH-ML @ UNM
one protein-disease
association at the time
O. Ursu, T Oprea et al., IDG2 KMC 2/01/18 revision
Genes associated with a disease/phenotype are positive examples, whereas genes lacking the same
association are negative examples. The Metapath approach transforms assertions/evidence chains into
classification problems that can be solved using suitably designed machine learning algorithms.
25
Use of XGBoost
(XGBoost = eXtreme Gradient Boosting)
● https://xgboost.ai/
● GitHub
● Documentation
● R package
● Exceptional interpretability
MetapathML employs XGBoost via the R package API. The inputs to XGBoost
are datasets specific to each disease or phenotype. For each disease/phenotype
some known associated genes correspond with the positive Y labels in the
dataset. XGBoost parameters are optimized via grid search, i.e. iterative testing
over discrete parameter value combinations.
27
Illuminating the Druggable Genome with Knowledge Engineering and Machine Learning
ALZHEIMER’S DISEASE (AD) METAPATH
ML MODEL
▪Build data matrix from “Alzheimer’s disease” in
TCRD subset
▪ protein knowledge graph along metapaths:
▪ Protein – Protein Interactions
▪ Pathways
▪ GO terms
▪ Gene expression
▪ Etc.
▪ Training set: 53 genes associated with
Alzheimer’s disease (positives); 3,952 genes
associated with other pathologies from OMIM
were assumed to be negative
▪ Test set: 23 genes associated with Alzheimer's
(positives) and 200 genes not associated with
Alzheimer's (negatives) ← from Text Mining
▪ “Complete forest” binary classifier using
XGBoost & 5-fold cross-validation.
2/14/18 revisionML work by Oleg Ursu
Predicted
Actual
Pos Neg
Pos 20 3
Neg 41 159
29
AD XGBOOST CLASSIFIER:
VARIABLE IMPORTANCE PLOT
▪ The top most important features are interactions with
proteins mediating inflammatory processes
(JAK2/Tclin, IL10 & IL2 / Tchem), response to oxidative
stress (GSTP1/Tchem), nervous system development
(BDNF/Tbio) and glycolysis (GAPDH/Tchem).
▪ LINCS drug-induced gene expression perturbations are
the largest category of features for these predictions.
▪ Brain cortex expression is a necessary requirement.
▪ One Reactome pathway (AU-rich mRNA elements binding
proteins) is also important.
▪ Weighted approached showed better performance in the
test set for Alzheimer's Disease, Schizophrenia, and Dilated
Cardiomyopathy.
4/23/18 revisionML work by Oleg Ursu
30
EXPERIMENTAL VALIDATION: AD
▪ SHSY5Ys pTau siRNA test
▪ Measured pTau levels after knock-down of gene expression
• Human iPSNs qPCR
▪ Measuring endogenous gene expression levels, AD vs Ctrl
▪ Western blot or ICC to characterize AD phenotype versus control
• Human Tissue qPCR
▪ Measuring endogenous gene expression levels, AD vs Ctrl
▪ Western blot or ICC to characterize AD phenotype versus control
11/14/18 revision
AD validation work by Jessica Binder & Kiran Bhaskar (UNM), funded by U24CA224370-S2 supplement 31
2/14/19 revisionAD validation work by Jessica Binder & Kiran Bhaskar (UNM), funded by U24CA224370-S2 supplement
▪Validation on the 20 predicted genes: AKNA, BC02, CCNY,
CRTAM, FAM92B, FOXP4, FRRS1, GRIN2C, 1L17REL, LILRA3, LM04,
NDRG2, PIBF1, RAB40A, SCGB3A1, SLC44A2, SPOP, STARD3,
TMEFF2, TXNDC12
▪The most obvious effects based on the combined Cellomics &
qPCR of iPSNs & autopsy brains suggests that AKNA, LILRA3,
NDRG2 and TXNDC12 significantly increased pTau (as tracked
by two different antibodies for T180
, S202
and S205
)
▪For now, it appears that machine learning models may have
identified between 4 and 7 new genes that have previously not
been associated with Alzheimer’s Disease
32
EXPERIMENTAL VALIDATION: AD
33
EXPERIMENTAL VALIDATION: MORE
DISEASES AND COLLABORATORS
Disease Experimental Collaboration
Prostate cancer Work by Art Cherkasov, Kriti Singh & Mike Hsing (UBC, Vancouver). Of
the top 50 ML predicted genes, 19 commonly upregulated in YZ Wang
Transdifferentiation PDX model and Beltran dataset 2016.
Ovarian cancer Spheroid tumor & patient-derived xenograft (PDX) work by Mara
Steinkamp (UNM). From the top ML predicted 63 genes, 12 genes show
significant changes in cancer cells.
NEXT STEPS:
● In vivo experiments.
● More diseases and phenotypes.
ML LEARNINGS IN TARGET AND
DRUG DISCOVERY
1. Model quality is limited by data quality. Good data → good models.
2. ML can identify hidden patterns in big data. For example, the
central node(s) in PPI network(s) that are a playing critical role in
disease pathology.
3. Deep learning not so applicable to our task (better for tall datasets,
well defined good solutions, less need for interpretability).
4. XGBoost (decision tree algorithm) excels in performance &
interpretability.
5. Shows real promise in Target Repurposing. 34
35
IN CLOSING...
35
● IDG platform for knowledge
discovery about the "dark genome."
● ML provides new insights by
integrating multi-omics knowledge
graphs.
● Hard questions should be directed to
Tudor Oprea!

More Related Content

PPTX
2015 bioinformatics personal_genomics_wim_vancriekinge
PDF
Enriching Scholarship Personal Genomics presentation
PPT
Use of data
PPTX
Emerging challenges in data-intensive genomics
PPTX
Data analytics challenges in genomics
PPTX
Data analysis & integration challenges in genomics
PPT
2015 03 13_puurs_v_public
PPTX
Sundaram et al. 2018 Presentation
2015 bioinformatics personal_genomics_wim_vancriekinge
Enriching Scholarship Personal Genomics presentation
Use of data
Emerging challenges in data-intensive genomics
Data analytics challenges in genomics
Data analysis & integration challenges in genomics
2015 03 13_puurs_v_public
Sundaram et al. 2018 Presentation

What's hot (19)

PPTX
Ml in genomics
PPTX
What's In a Genotype?: An Ontological Characterization for the Integration of...
PPT
Bioinformatics workshop presentation
DOC
Epigeneticsand methylation
PPTX
WikiPathways: how open source and open data can make omics technology more us...
PDF
A new assay for measuring chromosome instability (CIN) and identification of...
PDF
Integrative bioinformatics analysis of Parkinson's disease related omics data
PPTX
NetBioSIG2014-Talk by Traver Hart
PPTX
Bioinformatic Analysis of Synthetic Lethality in Breast Cancer
PDF
Variant G6PD levels promote tumor cell proliferation or apoptosis via the STA...
PDF
Insilico binding studies on tau protein and pp2 a as alternative targets in a...
PDF
A systematic, data driven approach to the combined analysis of microarray and...
PPTX
Analysis with biological pathways:
PPTX
FunGen JC Presentation - Mostafavi et al. (2019)
PPTX
2015 bioinformatics bio_cheminformatics_wim_vancriekinge
DOC
The human genome project was started in 1990 with the goal of sequencing and ...
PDF
Introduction to Bioinformatics.
PDF
Genome responses of trypanosome infected cattle
PPTX
Marsh pers strat-mednov2014
Ml in genomics
What's In a Genotype?: An Ontological Characterization for the Integration of...
Bioinformatics workshop presentation
Epigeneticsand methylation
WikiPathways: how open source and open data can make omics technology more us...
A new assay for measuring chromosome instability (CIN) and identification of...
Integrative bioinformatics analysis of Parkinson's disease related omics data
NetBioSIG2014-Talk by Traver Hart
Bioinformatic Analysis of Synthetic Lethality in Breast Cancer
Variant G6PD levels promote tumor cell proliferation or apoptosis via the STA...
Insilico binding studies on tau protein and pp2 a as alternative targets in a...
A systematic, data driven approach to the combined analysis of microarray and...
Analysis with biological pathways:
FunGen JC Presentation - Mostafavi et al. (2019)
2015 bioinformatics bio_cheminformatics_wim_vancriekinge
The human genome project was started in 1990 with the goal of sequencing and ...
Introduction to Bioinformatics.
Genome responses of trypanosome infected cattle
Marsh pers strat-mednov2014
Ad

Similar to Illuminating the Druggable Genome with Knowledge Engineering and Machine Learning (20)

PDF
Drug Repositioning Conference Washington DC 20190923
PPTX
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
PDF
Illuminating the druggable genome and the quest for new drug targets
PDF
Overpromise of AI in Drug Discovery
PDF
Amia tb-review-12
PDF
Deep learning for biomedicine
PDF
Advanced Bioinformatics for Genomics and BioData Driven Research
PDF
Bioinformatics and Drug Discovery Richard S Larson
PDF
Schierz ODSC Meetup pdf
PDF
Open-Source Bioinformatics for Data Scientists with Amanda Schierz
PDF
Use cases
PPTX
Generating Biomedical Hypotheses Using Semantic Web Technologies
PPTX
Application of Biomedical Informatics in Clinical Problem Solving
PPTX
Data Science Meets Drug Discovery
PPTX
Scripps bioinformatics seminar_day_2
PDF
Unravelling the molecular linkage of co morbid diseases
PDF
Unravelling the molecular linkage of co morbid
PDF
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
PPTX
Data-driven drug discovery for rare diseases - Tales from the trenches (CINF ...
PPTX
Target Identification - Gene Disease and Protein Target Prediction
Drug Repositioning Conference Washington DC 20190923
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
Illuminating the druggable genome and the quest for new drug targets
Overpromise of AI in Drug Discovery
Amia tb-review-12
Deep learning for biomedicine
Advanced Bioinformatics for Genomics and BioData Driven Research
Bioinformatics and Drug Discovery Richard S Larson
Schierz ODSC Meetup pdf
Open-Source Bioinformatics for Data Scientists with Amanda Schierz
Use cases
Generating Biomedical Hypotheses Using Semantic Web Technologies
Application of Biomedical Informatics in Clinical Problem Solving
Data Science Meets Drug Discovery
Scripps bioinformatics seminar_day_2
Unravelling the molecular linkage of co morbid diseases
Unravelling the molecular linkage of co morbid
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
Data-driven drug discovery for rare diseases - Tales from the trenches (CINF ...
Target Identification - Gene Disease and Protein Target Prediction
Ad

More from Jeremy Yang (20)

PDF
TIGA: Target Illumination GWAS Analytics
PDF
DrugCentralDb and BioClients: Dockerized PostgreSql with Python API-tizer
PDF
Mining ClinicalTrials.gov via CTTI AACT for drug target hypotheses
PDF
TIN-X v2: modernized architecture with REST API
PDF
Ex-files: Sex-Specific Gene Expression Profiles Explorer
PDF
Open Phenotypic Drug Discovery Resource poster
PDF
Badapple: promiscuity patterns from noisy evidence (poster)
PDF
Bibliological data science and drug discovery
PDF
BioMISS: Language Diversity of Computing
PDF
The Language Diversity of Computing
PDF
RMSD: routine measure stirs doubts
PDF
Canonicalized systematic nomenclature in cheminformatics
PDF
Molecular scaffolds poster
PDF
Molecular scaffolds are special and useful guides to discovery
PDF
The BADAPPLE promiscuity plugin for BARD
PDF
Cheminformatics Software Development: Case Studies
PDF
How am I supposed to organize a protein database when I can't even organize m...
PDF
UNM Division of Biocomputing public web applications
PDF
Cyberinfrastructure Day 2010: Applications in Biocomputing
PPT
Promiscuous patterns and perils in PubChem and the MLSCN
TIGA: Target Illumination GWAS Analytics
DrugCentralDb and BioClients: Dockerized PostgreSql with Python API-tizer
Mining ClinicalTrials.gov via CTTI AACT for drug target hypotheses
TIN-X v2: modernized architecture with REST API
Ex-files: Sex-Specific Gene Expression Profiles Explorer
Open Phenotypic Drug Discovery Resource poster
Badapple: promiscuity patterns from noisy evidence (poster)
Bibliological data science and drug discovery
BioMISS: Language Diversity of Computing
The Language Diversity of Computing
RMSD: routine measure stirs doubts
Canonicalized systematic nomenclature in cheminformatics
Molecular scaffolds poster
Molecular scaffolds are special and useful guides to discovery
The BADAPPLE promiscuity plugin for BARD
Cheminformatics Software Development: Case Studies
How am I supposed to organize a protein database when I can't even organize m...
UNM Division of Biocomputing public web applications
Cyberinfrastructure Day 2010: Applications in Biocomputing
Promiscuous patterns and perils in PubChem and the MLSCN

Recently uploaded (20)

PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
Application of enzymes in medicine (2).pptx
PDF
The Land of Punt — A research by Dhani Irwanto
PDF
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
BODY FLUIDS AND CIRCULATION class 11 .pptx
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPTX
CORDINATION COMPOUND AND ITS APPLICATIONS
PPTX
Introduction to Cardiovascular system_structure and functions-1
PDF
The scientific heritage No 166 (166) (2025)
PDF
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
PDF
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
PPT
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
PPTX
Seminar Hypertension and Kidney diseases.pptx
PPTX
Microbes in human welfare class 12 .pptx
PPT
veterinary parasitology ````````````.ppt
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
Placing the Near-Earth Object Impact Probability in Context
Application of enzymes in medicine (2).pptx
The Land of Punt — A research by Dhani Irwanto
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
BODY FLUIDS AND CIRCULATION class 11 .pptx
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
CORDINATION COMPOUND AND ITS APPLICATIONS
Introduction to Cardiovascular system_structure and functions-1
The scientific heritage No 166 (166) (2025)
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
Seminar Hypertension and Kidney diseases.pptx
Microbes in human welfare class 12 .pptx
veterinary parasitology ````````````.ppt
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
Looking into the jet cone of the neutrino-associated very high-energy blazar ...

Illuminating the Druggable Genome with Knowledge Engineering and Machine Learning

  • 1. Giovanni Bocci, Cristian Bologa, Daniel Byrd, Jayme Holmes, Stephen Mathias, Oleg Ursu, Anna Waller, Jeremy Yang & Tudor Oprea 03/15/2019 INBRE-NMBIST Symposium Santa Fe, NM Funding: NIH U24 CA224370 & NIH U24 TR002278 ILLUMINATING THE DRUGGABLE GENOME WITH KNOWLEDGE ENGINEERING AND MACHINE LEARNING datascience.unm.edupharos.nih.gov/idg/ druggablegenome.net
  • 2. 75% of protein research still focused on 10% genes known before human genome was mapped AM Edwards et al, Nature, 2011 This prompted NIH to start the Illuminating the Druggable Genome Initiative (U54, Common Fund) HGP
  • 3. 3 "If I have seen further it is by standing on the shoulders of Giants." - Isaac Newton, ~1675 Organization of this talk: 1. Shoulders development (knowledge engineering). 2. Seeing further efforts (machine learning).
  • 7. ML READY: PHAROS 11/13/18 revisionhttps://pharos.nih.gov/idg/targets/GRIN2A 7
  • 8. ML READY: DRUGCENTRAL 10/16/18 revision http://drugcentral.org/drugcard/1679 8
  • 10. COMPONENTS OF IDG https://druggablegenome.net/ DRGC RDOC IT KMC RFA-RM-16-026 (DRGC) GPCR U24 DK116195: Bryan Roth, M.D., Ph.D. (UNC) Brian Shoichet, Ph.D. (UCSF) Ion Channel U24 DK116214: Lily Jan, Ph.D. (UCSF) Michael T. McManus, Ph.D. (UCSF) Kinase U24 DK116204: Gary L. Johnson, Ph.D. (UNC) RFA-RM-16-025 (RDOC) U24 TR002278: Stephan C. Schürer, Ph.D.  (UMiami) Dusica Vidovic, Ph.D.  (UMiami) Tudor Oprea, M.D., Ph.D.  (UNM) Larry A. Sklar, Ph.D.  (UNM) RFA-RM-16-024 (KMC) U24 CA224260: Avi Ma’ayan, Ph.D.  (ISMMS) U24 CA224370: Tudor Oprea, M.D., Ph.D. (UNM) RFA-RM-18-011 (CEIT) Awards starting date March 2019 Further information Email: idg.rdoc@gmail.com Follow: @DruggableGenome URLs: https://druggablegenome.net / https://commonfund.nih.gov/i dg/ IDG Knowledge User-Interface Email: pharos@mail.nih.gov Follow: @IDG_Pharos URL: https://pharos.nih.gov/ 10
  • 11. TARGET DEVELOPMENT LEVEL (TDL) ▪ Most protein classification schemes are based on structural and functional criteria. ▪ For therapeutic development, it is useful to understand how much and what types of data are available for a given protein, thereby highlighting well-studied and understudied targets. ▪ Tclin: Proteins annotated as drug targets ▪ Tchem: Proteins for which potent small molecules are known ▪ Tbio: Proteins for which biology is better understood ▪ Tdark: These proteins lack antibodies, publications or Gene RIFs 3/23/18 revision T. Oprea et al., Nature Rev. Drug Discov. 2018, https://www.nature.com/articles/nrd.2018.14 11
  • 12. TDL LEVELS: Tclin and Tchem ▪ Tclin proteins are associated with drug Mechanism of Action (MoA) – NRDD 2017 ▪ Tchem proteins have bioactivitis in ChEMBL and DrugCentral, + human curation for some targets ▪ Kinases: <= 30nM ▪ GPCRs: <= 100nM ▪ Nuclear Receptors: <= 100nM ▪ Ion Channels: <= 10μM ▪ Non-IDG Family Targets: <= 1μM 10/19/16 revision Bioactivities of approved drugs (by Target class) ChEMBL: database of bioactive chemicals https://www.ebi.ac.uk/chembl/ DrugCentral: online drug compendium http://drugcentral.org/ R. Santos et al., Nature Rev. Drug Discov. 2017, https://www.nature.com/articles/nrd.2016.230 12
  • 13. TDL LEVELS Tbio and Tdark ▪ Tbio proteins lack small molecule annotation cf. Tchem criteria, and satisfy one of these criteria: ▪ protein is above the cutoff criteria for Tdark ▪ protein is annotated with a GO Molecular Function or Biological Process leaf term(s) with an Experimental Evidence code ▪ protein has confirmed OMIM phenotype(s) ▪ Tdark (“ignorome”) have little information available, and satisfy these criteria: ▪ PubMed text-mining score from Jensen Lab < 5 ▪ <= 3 Gene RIFs ▪ <= 50 Antibodies available according to antibodypedia.com 13
  • 14. TDL: EXTERNAL VALIDATION Tdark parameters differ from the other TDLs across the 4 external metrics cf. Kruskal-Wallis post-hoc pairwise Dunn tests 2/23/18 revision T. Oprea et al., Nature Rev. Drug Discov. 2018, https://www.nature.com/articles/nrd.2018.14 14
  • 15. WHY FUND TDARK RESEARCH? 2/23/18 revision T. Oprea et al., Nature Rev. Drug Discov. 2018, https://www.nature.com/articles/nrd.2018.14 Typically, it takes 15-20 years for a Tdark protein to become druggable 15
  • 16. IMPC BOLDLY GOES WHERE NO ONE HAS GONE BEFORE 95% of eligible IDG genes (339/356) have plans, attempts, or models 384 genes were prioritized by IDG KMC (2014-2016) 17 28 17 1 63 24 50 79 168306 Tbio genes 90 Tdark genes 42 Tchem genes 11/29/17 revision Slide from Steve Murray, Jackson Lab 16
  • 17. TAKE HOME MESSAGE: THERE IS A KNOWLEDGE DEFICIT 3/12/18 revision ~35% of the proteins remain poorly described (Tdark) ~11% of the Proteome (Tclin & Tchem) are currently targeted by small molecule probes Choosing to work on dark genes is a high-risk endeavor (Funders are less likely to award grants for Tdark)
  • 18. CHALLENGE: RANKING & SCORING PROTEIN-DISEASE ASSOCIATIONS https://pharos-beta.ncats.io/targets/GRIN2A The IDG KMC tracks more ~10 information channels for protein-disease associations, accessible via the Pharos portal. Our challenge is to harmonize disease concepts, and to enable computational use: e.g., GRIN2A with GRIN1 form the Glutamate NMDA receptor, MoA drug target for memantine (Alzheimer’s). The challenge for ML & AI: How to prioritize targets? i.e., which protein-disease associations are clinically actionable? 10/07/18 revision 18
  • 19. WHAT DO WE KNOW ABOUT DISEASES? ▪ There are between 9,000 and 25,000 disease concepts ▪ Pharos/TCRD tracks ~11,000 disease via Disease Ontology, and ~10500 rare disease via eRAM, OrphaNet and the Monarch Initiative MONDO system 19
  • 20. PROTEIN KNOWLEDGE GRAPHS ▪ IDG KMC2 seeks knowledge gaps across the five branches of the “knowledge tree”: ▪ Genotype; Phenotype; Interactions & Pathways; Structure & Function; and Expression, respectively. ▪ We can use biological systems network modeling to infer novel relationships based on available evidence, and infer new “function” and “role in disease” data based on other layers of evidence ▪ Primary focus on Tdark & Tbio O. Ursu, T Oprea et al., IDG2 KMC 2/01/18 revision 20
  • 21. THE METAPATH-ML APPROACH▪ A metapath is a sequence of relations defined between different object types. ▪ Our metapaths encode type-specific network topology between the source node (Protein) and the destination node (Disease/Phenotype). ▪ This approach enables the transformation of assertions/evidence chains of heterogeneous biological data types into a ML ready format. SOME REFS: G. Fu et al., BMC Bioinformatics 2016. D Himmelstein & S Baranzini, PLOS Comp Bio, 2015. Similar assertions or evidence form metapaths (white). Instances of metapath (paths) are used to determine the strength of the evidence linking a gene to disease/phenotype/function. 21
  • 22. 22 SOME EARLY ACKNOWLEDGMENTS ... Abstract: We hear a lot about machine learning and its role in health care, but these methods require large amounts of training data. Using these and other related method to study rare diseases poses substantial challenges: how can we get tens of thousands of training examples when there are tens or hundreds of people with a disease? (Abstract from this conference) Scarce training data our problem too. But with genes instead of people. Our MetapathML method similar to Himmelstein-Baranzini. (Daniel is now post-doc in Greene Lab.)
  • 23. METAPATH-ML DATA SOURCES O. Ursu et al., manuscript in preparation Data source Data type Data points CCLE Gene expression 19,006,134 GTEx Gene expression 2,612,227 Protein Atlas Gene & Protein expression 949,199 Reactome Biological pathways 303,681 KEGG Biological pathways 27,683 StringDB Protein-Protein interactions 5,080,023 Gene ontology Biological pathways & Gene function 434,317 InterPro Protein structure and function 467,163 ClinVar Human Gene - Disease/Phenotype associations 881,357 GWAS Gene - Disease/Phenotype associations 54,360 OMIM Human Gene - Disease/Phenotype associations 25,557 UniProt Disease Human Gene - Disease/Phenotype associations 5,365 JensenLab DISEASE Gene - Disease associations from text mining 44,829 NCBI Homology Homology mapping of human/mouse/rat genes 70,922 IMPC Mouse Gene - Phenotype associations 2,153,999 RGD Rat Gene - Phenotype associations 117,606 LINCS Drug induced gene signatures 230,111,315 We developed automated methods for data collection (TCRD), visualization (Pharos) and data aggregation.   These aggregated datasets were used to build machine learning models for 20+ disease and 73 mouse phenotype. Each knowledge graph contains ~22,000 metapaths and 284 million path instances. 10/07/18 revision 23
  • 24. METAPATH-ML WORKFLOW ▪ A meta-path encodes type-specific network topology between the source node (e.g., Protein target) and the destination node (e.g., Disease or Function) ▪ Target –– (member of) → PPI Network ← (member of) –– Protein –– (associated with) → Disease ▪ Target –– (expressed in) → Tissue ← (localized in) –– Disease O. Ursu, T Oprea et al., IDG2 KMC 2/01/18 revision 24
  • 25. METAPATH-ML @ UNM one protein-disease association at the time O. Ursu, T Oprea et al., IDG2 KMC 2/01/18 revision Genes associated with a disease/phenotype are positive examples, whereas genes lacking the same association are negative examples. The Metapath approach transforms assertions/evidence chains into classification problems that can be solved using suitably designed machine learning algorithms. 25
  • 26. Use of XGBoost (XGBoost = eXtreme Gradient Boosting) ● https://xgboost.ai/ ● GitHub ● Documentation ● R package ● Exceptional interpretability MetapathML employs XGBoost via the R package API. The inputs to XGBoost are datasets specific to each disease or phenotype. For each disease/phenotype some known associated genes correspond with the positive Y labels in the dataset. XGBoost parameters are optimized via grid search, i.e. iterative testing over discrete parameter value combinations.
  • 27. 27
  • 29. ALZHEIMER’S DISEASE (AD) METAPATH ML MODEL ▪Build data matrix from “Alzheimer’s disease” in TCRD subset ▪ protein knowledge graph along metapaths: ▪ Protein – Protein Interactions ▪ Pathways ▪ GO terms ▪ Gene expression ▪ Etc. ▪ Training set: 53 genes associated with Alzheimer’s disease (positives); 3,952 genes associated with other pathologies from OMIM were assumed to be negative ▪ Test set: 23 genes associated with Alzheimer's (positives) and 200 genes not associated with Alzheimer's (negatives) ← from Text Mining ▪ “Complete forest” binary classifier using XGBoost & 5-fold cross-validation. 2/14/18 revisionML work by Oleg Ursu Predicted Actual Pos Neg Pos 20 3 Neg 41 159 29
  • 30. AD XGBOOST CLASSIFIER: VARIABLE IMPORTANCE PLOT ▪ The top most important features are interactions with proteins mediating inflammatory processes (JAK2/Tclin, IL10 & IL2 / Tchem), response to oxidative stress (GSTP1/Tchem), nervous system development (BDNF/Tbio) and glycolysis (GAPDH/Tchem). ▪ LINCS drug-induced gene expression perturbations are the largest category of features for these predictions. ▪ Brain cortex expression is a necessary requirement. ▪ One Reactome pathway (AU-rich mRNA elements binding proteins) is also important. ▪ Weighted approached showed better performance in the test set for Alzheimer's Disease, Schizophrenia, and Dilated Cardiomyopathy. 4/23/18 revisionML work by Oleg Ursu 30
  • 31. EXPERIMENTAL VALIDATION: AD ▪ SHSY5Ys pTau siRNA test ▪ Measured pTau levels after knock-down of gene expression • Human iPSNs qPCR ▪ Measuring endogenous gene expression levels, AD vs Ctrl ▪ Western blot or ICC to characterize AD phenotype versus control • Human Tissue qPCR ▪ Measuring endogenous gene expression levels, AD vs Ctrl ▪ Western blot or ICC to characterize AD phenotype versus control 11/14/18 revision AD validation work by Jessica Binder & Kiran Bhaskar (UNM), funded by U24CA224370-S2 supplement 31
  • 32. 2/14/19 revisionAD validation work by Jessica Binder & Kiran Bhaskar (UNM), funded by U24CA224370-S2 supplement ▪Validation on the 20 predicted genes: AKNA, BC02, CCNY, CRTAM, FAM92B, FOXP4, FRRS1, GRIN2C, 1L17REL, LILRA3, LM04, NDRG2, PIBF1, RAB40A, SCGB3A1, SLC44A2, SPOP, STARD3, TMEFF2, TXNDC12 ▪The most obvious effects based on the combined Cellomics & qPCR of iPSNs & autopsy brains suggests that AKNA, LILRA3, NDRG2 and TXNDC12 significantly increased pTau (as tracked by two different antibodies for T180 , S202 and S205 ) ▪For now, it appears that machine learning models may have identified between 4 and 7 new genes that have previously not been associated with Alzheimer’s Disease 32 EXPERIMENTAL VALIDATION: AD
  • 33. 33 EXPERIMENTAL VALIDATION: MORE DISEASES AND COLLABORATORS Disease Experimental Collaboration Prostate cancer Work by Art Cherkasov, Kriti Singh & Mike Hsing (UBC, Vancouver). Of the top 50 ML predicted genes, 19 commonly upregulated in YZ Wang Transdifferentiation PDX model and Beltran dataset 2016. Ovarian cancer Spheroid tumor & patient-derived xenograft (PDX) work by Mara Steinkamp (UNM). From the top ML predicted 63 genes, 12 genes show significant changes in cancer cells. NEXT STEPS: ● In vivo experiments. ● More diseases and phenotypes.
  • 34. ML LEARNINGS IN TARGET AND DRUG DISCOVERY 1. Model quality is limited by data quality. Good data → good models. 2. ML can identify hidden patterns in big data. For example, the central node(s) in PPI network(s) that are a playing critical role in disease pathology. 3. Deep learning not so applicable to our task (better for tall datasets, well defined good solutions, less need for interpretability). 4. XGBoost (decision tree algorithm) excels in performance & interpretability. 5. Shows real promise in Target Repurposing. 34
  • 35. 35 IN CLOSING... 35 ● IDG platform for knowledge discovery about the "dark genome." ● ML provides new insights by integrating multi-omics knowledge graphs. ● Hard questions should be directed to Tudor Oprea!