Bigger Data to Increase Drug DiscoveryBigger Data to Increase Drug Discovery
Sean EkinsSean Ekins
Phoenix Nest, Inc., Brooklyn, NY.
Collaborations in Chemistry, Inc., Fuquay Varina, NC.
Collaborative Drug Discovery, Inc., Burlingame, CA.
Collaborations Pharmaceuticals, Inc., Fuquay Varina, NC.
In a Perfect World…
• All major diseases cured
• All > 7000 rare diseases have treatments available
• Neglected diseases are eradicated
• Antibiotics, antivirals, vaccines developed to anticipate all
future mutations
• Drug resistance eradicated
• All research coordinated globally
• Government/individuals collaboration- discovers / fund all
research
• Billions of molecules will be available with data for different
targets
• All decisions will involve machine learning
• Life expectancy is infinite
Big DATA
Bigger Data to Increase Drug Discovery
Bigger Data to Increase Drug Discovery
Ebola- related tweets in a 6 week
period 2014
Robert Moore
Why ‘Bigger’ and not ‘Big’
Just a matter of scale?
Drug Discovery’s
definition of Big data
Everyone else’s definition of Big data
What about Chemistry and Biology -
Pharmacology X.0
• Data Sources
• PubChem
• ChEMBL
• ToxCast over 1800 molecules tested against over 800 endpoints
BUT
BUT
WHERE
ARE
THE
‘Big’ Chemistry DBs
But what about small data?
• In some cases its all we have
• In vivo data is not high throughput
• Small data builds networks DATA
V
http://smalldatagroup.com/
The past
• 1996
• Data from low throughput
Drug-drug interaction studies
• E.g. Ki values with CYP 3A4
• A drug company might have
10s of values
• This data was used to build
3D QSAR, pharmacophores
JPET, 290: 429-438, 1999
  Hydrophobi
c features 
(HPF)
Hydrogen 
bond 
acceptor 
(HBA)
Hydrogen 
bond 
donor 
(HBD)
Observed 
vs. 
predicted 
IC50 r
Acoustic mediated process
2 1 1 0.92
Tip-based process
0 2 1 0.80
Acoustic Tip based
Generated with Discovery Studio Generated with Discovery Studio 
(Accelrys)(Accelrys)
Cyan = hydrophobicCyan = hydrophobic
Green = hydrogen bond acceptorGreen = hydrogen bond acceptor
Purple = hydrogen bond donorPurple = hydrogen bond donor
Each model shows most potent Each model shows most potent 
molecule mappingmolecule mapping
How you dispense liquids may be important: insights from small dataHow you dispense liquids may be important: insights from small data
PLoS ONE 8(5): e62325 (2013)
Ebola inhibitor
Pharmacophore
Ekins S, Freundlich JS and Coffee M
F1000Research 2014, 3:277
Docking FDA approved
compounds in VP35
protein showing overlap
with ligand (yellow)
Proposed amodiaquine,
chloroquine, clomiphene toremifene
Which all are active in vitro may have
common features and bind common
site / target
A common feature pharmacophore for FDA-approved drugs inhibiting the Ebola virus
The last 5 years -Present
• 2010
• Data from high
throughput screens at
Pfizer
• E.g. metabolic
stability data ~200K
compounds
• This data was used to
build machine
learning models
• 2015
• Could easily be
double this amount
Drug Metab Dispos, 38: 2083-2090, 2010
Ebola Machine Learning Models
Models 
(training set 
868 
compounds)
RP Forest 
(Out of 
bag ROC)
RP Single 
Tree (With 5 
fold cross 
validation 
ROC)
SVM
(with 5 fold 
cross 
validation 
ROC) 
Bayesian 
(with 5 fold 
cross 
validation 
ROC)
Bayesian 
(leave out 
50% x 100 
ROC) 
Open Bayesian
(with 5 fold 
cross 
validation 
ROC)
Ebola 
replication 
(actives = 20)
0.70 0.78 0.73 0.86 0.86 0.82
Ebola 
Pseudotype 
(actives = 41)
0.85 0.81 0.76 0.85 0.82 0.82
Ekins, Freundlich, Madrid and Clark
https://goo.gl/uG8K3P
Tuberculosis still kills 1.6-1.7m/yr (~1 every 8 seconds)
1/3rd
of worlds population infected!!!!
streptomycin (1943)streptomycin (1943)
para-para-aminosalicyclic acid (1949)aminosalicyclic acid (1949)
isoniazid (1952)isoniazid (1952)
pyrazinamide (1954)pyrazinamide (1954)
cycloserine (1955)cycloserine (1955)
ethambutol (1962)ethambutol (1962)
rifampicin (1967)rifampicin (1967)
Multi drug resistance in 4.3% of casesMulti drug resistance in 4.3% of cases
Extensively drug resistant increasingExtensively drug resistant increasing
incidenceincidence
2 new drugs (bedaquiline, delamanid)2 new drugs (bedaquiline, delamanid)
in 40 yrsin 40 yrs
Tuberculosis – a big diseaseTuberculosis – a big disease
Tested >350,000 moleculesTested >350,000 molecules      Tested ~2M            2M     Tested ~2M            2M     >300,000    >300,000
>1500 active and non toxic>1500 active and non toxic     Published 177        100s    Published 177        100s         800         800 
Big Data: Screening for New Tuberculosis TreatmentsBig Data: Screening for New Tuberculosis Treatments 
How many will become a new drug?
How do we learn from this big data?
TBDA screened over 1 million, 1 million 
more to go
TB Alliance + Japanese pharma screens
Over 8000 molecules with dose
response data for Mtb in CDD Public
from NIAID/SRI
https://app.collaborativedrug.com/register
Over 6 years analyzed in vitro data and built models
Top scoring molecules
assayed for
Mtb growth inhibition
Mtb screening
molecule
database/s
High-throughput
phenotypic
Mtb screening
Descriptors + Bioactivity (+Cytotoxicity)
Bayesian Machine Learning classification Mtb Model
Molecule Database
(e.g. GSK malaria
actives)
virtually scored
using Bayesian Models
New bioactivity data
may enhance models
Identify in vitro hits and test models3 x published prospective tests ~750~750
molecules were testedmolecules were tested in vitroin vitro
198 actives were identified198 actives were identified
>20 % hit rate>20 % hit rate
Multiple retrospective tests 3-10 fold
enrichment
N
H
S
N
Ekins et al., Pharm Res 31: 414-435, 2014
Ekins, et al., Tuberculosis 94; 162-169, 2014
Ekins, et al., PLOSONE 8; e63240, 2013
Ekins, et al., Chem Biol 20: 370-378, 2013
Ekins, et al., JCIM, 53: 3054−3063, 2013
Ekins and Freundlich, Pharm Res, 28, 1859-1869, 2011
Ekins et al., Mol BioSyst, 6: 840-851, 2010
Ekins, et al., Mol. Biosyst. 6, 2316-2324, 2010,
5 active compounds vs Mtb in a few months
7 tested, 5 active (70% hit rate)
Ekins et al.,Chem
Biol 20, 370–378,
2013
1. Virtually screen
13,533-member GSK
antimalarial hit library
2. Bayesian Model = SRI
TAACF-CB2 dose
response + cytotoxicity
model
3. Top 46 commercially
available compounds
visually inspected
4. 7 compounds chosen
for Mtb testing based
on
- drug-likeness
- chemotype diversity
GSK #
Bayesian
Score Chemical Structure
Mtb H37Rv
MIC
(µg/mL)
GSK
Reported
% Inhibition
HepG2 @ 10
µM cmpd
TCMDC-
123868 5.73 >32 40
TCMDC-
125802 5.63 0.0625 5
TCMDC-
124192 5.27 2.0 4
TCMDC-
124334 5.20 2.0 4
TCMDC-
123856 5.09 1.0 83
TCMDC-
123640 4.66 >32 10
TCMDC-
124922 4.55 1.0 9
Filling out the triazine matrix using SARtable:
A new kind of map
Green = good activity, Red = bad; colored dots are predictions
No relationship between internal or external ROC and the
number of molecules in the training set?
PCA of combined
data and ARRA(red)
Ekins et al., J Chem Inf Model
54: 2157-2165 (2014)
Internal and leave out 50%x100 ROC track each other
External ROC less correlation
Smaller models do just as well with external testing
~350,000
What matters most >70 years of TB mouse in vivo data – Mind
the gap - 770 molecules
MIND THE TB GAP
Ekins et al.,
J Chem Inf Model 54: 1070-82, 2014
Ekins, Nuermberger & Freundlich
DDT 19: 1279-1282, 2014
In vivo Machine Learning Models
ROC 5 fold cross validation
RP Forest RP Single
Tree
SVM Bayesian
3 /11
(27.2%)
4/11
(36.4%)
7/11
(63.6%)
8/11
(72.7%)
External test set
Ekins et al.,
J Chem Inf Model 54: 1070-82, 2014
RP Forest RP Single
Tree
SVM Bayesian
0.75 0.71 0.77 0.73
ow can we find the in vivo active compound
e need a map..
>70 years of TB in vivo data
Green = in vivo mouse active
Empty = in vivo inactive
Yellow = 2013-2015 data
Uses Bayesian fingerprints
and clustering by similarity
Clark and Ekins - unpublished
Clustering in vivo
mouse TB dataHex
plot
>70 years of TB in vivo data
Green = in vivo mouse active
Empty = in vivo inactive
Yellow = 2013-2015
Clark and Ekins - unpublished
Clustering in vivo
mouse TB data
Triazine surrounded by
inactives
Issues
High Log P, poor solubility
How do we ‘increase drug discovery’?
• Make data and models more accessible
• Collaborate
• Share
– Create mobile apps
• Encourage engagement from non scientists
MoDELSRESIDE IN PAPERS
NOT ACCESSIBLE…THISIS
UNDESIRABLE
How do wesharethem?
How do weuseThem?
• CDD Vision
Uses Bayesian algorithm and FCFP_6 fingerprints
Bayesian models
Clark et al., J Cheminform 6:38 2014
Predictions for the InhA target: (a) the ROC curve with ECFP_6 and FCFP_6Predictions for the InhA target: (a) the ROC curve with ECFP_6 and FCFP_6
fingerprints; (b) modified Bayesian estimators for active and inactive compounds;fingerprints; (b) modified Bayesian estimators for active and inactive compounds;
(c) structures of selected binders.(c) structures of selected binders.
For each listed target with at least two binders, it is first assumed that all of theFor each listed target with at least two binders, it is first assumed that all of the
molecules in the collection that do not indicate this as one of their targets aremolecules in the collection that do not indicate this as one of their targets are
inactive.inactive.
In the app we used ECFP_6 fingerprintsIn the app we used ECFP_6 fingerprints
Building Bayesian models for each target in TB MobileBuilding Bayesian models for each target in TB Mobile
Clark et al., J Cheminform 6:38 2014
TB Mobile Vers.2TB Mobile Vers.2
Ekins et al., J Cheminform 5:13, 2013
Clark et al., J Cheminform 6:38 2014
Predict targets
Cluster molecules
http://goo.gl/vPOKS
http://goo.gl/iDJFR
Predictions for 2013-2015 in vivo
molecules
Bayesian models added to mobile apps: MMDS
Bayesian models added to mobile apps:
Approved drugs
Human Microsomal
Intrinsic clearance
Human protein binding Solubility pH 7.4
AZ dataset models >1000 molecules
Models from ChEMBL data
http://molsync.com/bayesian2
What do 2000 ChEMBL models
look like
Folding bit size
Average
ROC
http://molsync.com/bayesian2
Bigger datasets and model
collections
• Profiling “big datasets” is going to be the norm.
• A recent study mined PubChem datasets for
compounds that have rat in vivo acute toxicity
data
• This could be used in other big data initiatives
like ToxCast (> 1000 compounds x 800 assays)
and Tox21 etc.
• Kinase screening data (1000s mols x 100s
assays)
• GPCR datasets etc (1000s mols x 100s assays)
Zhang J, Hsieh JH, Zhu H (2014) Profiling Animal
Toxicants by Automatically Mining Public
Bioassay Data: A Big Data Approach for
Computational Toxicology. PLoS ONE 9(6):
e99863. doi:10.1371/journal.pone.0099863
http://127.0.0.1:8081/plosone/article?id=info:doi/1
• Data is at your fingertips instantly
• labs add data to a massive corpus
of knowledge
• Instantly available to all
• Algorithms for mining, prediction
• Millions of models accessible
• Making decisions on experiments
needed and running them
• Data visualization, exploration is
real-time, updated
• Data follows you
Sean Ekins, a computational drug discovery consultant at Collaborations in
Chemistry in North Carolina, is much more skeptical. He notes pharma
companies have found hundreds of antimalaria compounds more potent
than TNP-470 and says that he is not convinced Eve can do QSAR. He wants
to see Eve go head-to-head with a real computational chemist. “Eve should
go back to the Garden of Eden and leave drug discovery to scientists who
know what they are doing,” Ekins says.
How close are we?
• Computers and models do not replace scientists
• A tool to help us sift through ideas quickly
• Many examples have lead to leads
• Bigger data not needed for good models
• More data becoming public
• Can model ADME, bioactivity and more
• Collaboration and software is important
• Mobile apps have useful cheminformatics features -
aid anyone to do drug discovery
• Models are compact < 1MB and portable
• The age of model sharing is here
Conclusions
Wanted
• “Bigger” small moleculescreening datasets
• Preferably > 500,000 – 1,000,000 moleculeswith data
• To test how machinelearningAlgorithmsScale
• Contact ekinssean@yahoo.com
Nadia Litterman, Krishna Dole and all at CDD, Megan Coffee, SRI, MM4TB and manyNadia Litterman, Krishna Dole and all at CDD, Megan Coffee, SRI, MM4TB and many
others …Funding:others …Funding: Bill and Melinda Gates Foundation (Grant#49852)Bill and Melinda Gates Foundation (Grant#49852) 1R41AI088893-01,1R41AI088893-01,
2R42AI088893-02, R43 LM011152-01,2R42AI088893-02, R43 LM011152-01, 9R44TR000942-02, 1R41AI108003-01,
1U19AI109713-01, MM4TB, Software: BioviaMM4TB, Software: Biovia
Freundlich Lab

More Related Content

PPT
Collaborative Drug Discovery: A Platform For Transforming Neglected Disease R...
PDF
Using In Silico Tools in Repurposing Drugs for Neglected and Orphan Diseases
PPT
Slides for st judes
PDF
Exploiting bigger data and collaborative tools for predictive drug discovery
PDF
academic / small company collaborations for rare and neglected diseasesv2
PPTX
Repositioning Old Drugs For New Indications Using Computational Approaches
PPTX
Open zika presentation
PDF
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...
Collaborative Drug Discovery: A Platform For Transforming Neglected Disease R...
Using In Silico Tools in Repurposing Drugs for Neglected and Orphan Diseases
Slides for st judes
Exploiting bigger data and collaborative tools for predictive drug discovery
academic / small company collaborations for rare and neglected diseasesv2
Repositioning Old Drugs For New Indications Using Computational Approaches
Open zika presentation
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...

What's hot (13)

PPT
Vanderwall cheminformatics Drexel Part 1
PPT
Jack Tuszynski Accelerating Chemotherapy Drug Discovery with Analytics and Hi...
PPTX
Drug Repurposing Against Infectious Diseases
PPTX
CDD: Vault, CDD: Vision and CDD: Models for Drug Discovery Collaborations
PPTX
2015 6 bd2k_biobranch_knowbio
PPT
Collaborative Database and Computational Models for Tuberculosis Drug Discovery
PDF
MSR david-heckerman_genomics
PPTX
Using Computational Toxicology to Enable Risk-Based Chemical Safety Decision ...
PPTX
Application of Computational and High-Throughput in vitro Screening for Prior...
PPTX
Computational Toxicity in 21st Century Safety Sciences
PPTX
Challenges and recommendations for obtaining chemical structures of industry
PPTX
challenges and recommendations for obtaining chemical structures of industry-...
PDF
Real-Time Genome Sequencing of Resistant Bacteria Provides Precision Infectio...
Vanderwall cheminformatics Drexel Part 1
Jack Tuszynski Accelerating Chemotherapy Drug Discovery with Analytics and Hi...
Drug Repurposing Against Infectious Diseases
CDD: Vault, CDD: Vision and CDD: Models for Drug Discovery Collaborations
2015 6 bd2k_biobranch_knowbio
Collaborative Database and Computational Models for Tuberculosis Drug Discovery
MSR david-heckerman_genomics
Using Computational Toxicology to Enable Risk-Based Chemical Safety Decision ...
Application of Computational and High-Throughput in vitro Screening for Prior...
Computational Toxicity in 21st Century Safety Sciences
Challenges and recommendations for obtaining chemical structures of industry
challenges and recommendations for obtaining chemical structures of industry-...
Real-Time Genome Sequencing of Resistant Bacteria Provides Precision Infectio...
Ad

Similar to Bigger Data to Increase Drug Discovery (20)

PDF
C&E news talk sept 16
PPTX
dual-event machine learning models to accelerate drug discovery
PPT
acs talk open source drug discovery
PPT
Unc slides on computational toxicology
PDF
Combining Metabolite-Based Pharmacophores with Bayesian Machine Learning Mode...
PDF
Looking Back at Mycobacterium tuberculosis Mouse Efficacy Testing To Move Ne...
PPT
Finland Helsinki Drug Research slides 2011
PDF
Acs combining cheminformatics methods and pathway analysis to identify molecu...
PDF
Applying cheminformatics and bioinformatics approaches to neglected tropical ...
PPT
Enhancing high throughput screeing for mycobacterium tuberculosis drug discov...
PDF
Acs talk data intensive drug design v2
PDF
Drug Discovery Today: Fighting TB with Technology
PPTX
Bioinformatics t9-t10-biocheminformatics v2014
PPT
Nc state lecture v2 Computational Toxicology
PPTX
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
PPTX
Make better drug discovery decisions through collaborative analytics cdd we...
PPTX
Slas talk 2016
PPTX
Development and comparison of deep learning toolkit with other machine learni...
PPT
Accelrys UGM slides 2011
PPTX
Bayesian Models for Chagas Disease
C&E news talk sept 16
dual-event machine learning models to accelerate drug discovery
acs talk open source drug discovery
Unc slides on computational toxicology
Combining Metabolite-Based Pharmacophores with Bayesian Machine Learning Mode...
Looking Back at Mycobacterium tuberculosis Mouse Efficacy Testing To Move Ne...
Finland Helsinki Drug Research slides 2011
Acs combining cheminformatics methods and pathway analysis to identify molecu...
Applying cheminformatics and bioinformatics approaches to neglected tropical ...
Enhancing high throughput screeing for mycobacterium tuberculosis drug discov...
Acs talk data intensive drug design v2
Drug Discovery Today: Fighting TB with Technology
Bioinformatics t9-t10-biocheminformatics v2014
Nc state lecture v2 Computational Toxicology
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
Make better drug discovery decisions through collaborative analytics cdd we...
Slas talk 2016
Development and comparison of deep learning toolkit with other machine learni...
Accelrys UGM slides 2011
Bayesian Models for Chagas Disease
Ad

More from Sean Ekins (20)

PDF
Applying Artificial Intelligence in a Small Drug Discovery Company
PPTX
How to Win a small business grant.pptx
PDF
Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...
PPTX
A presentation at the Global Genes rare drug development symposium on governm...
PPTX
Leveraging Science Communication and Social Media to Build Your Brand and Ele...
PPTX
Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...
PPTX
Drug Discovery Today March 2017 special issue
PPTX
Five Ways to Use Social Media to Raise Awareness for Your Paper or Research
PPTX
CDD models case study #3
PPTX
CDD models case study #2
PPTX
CDD Models case study #1
PDF
CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...
PDF
The future of computational chemistry b ig
PDF
#ZikaOpen: Homology Models -
PDF
Pros and cons of social networking for scientists
PPTX
Rare pediatric and neglected tropical diseases priority review voucher and tr...
PDF
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
PDF
Infographic for Sanfilippo Syndrome IIIC and IIID
PDF
Cmt update-summer-newsletter-2015
PPTX
The value of the pediatric and tropical disease voucher - The golden ticket
Applying Artificial Intelligence in a Small Drug Discovery Company
How to Win a small business grant.pptx
Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...
A presentation at the Global Genes rare drug development symposium on governm...
Leveraging Science Communication and Social Media to Build Your Brand and Ele...
Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...
Drug Discovery Today March 2017 special issue
Five Ways to Use Social Media to Raise Awareness for Your Paper or Research
CDD models case study #3
CDD models case study #2
CDD Models case study #1
CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...
The future of computational chemistry b ig
#ZikaOpen: Homology Models -
Pros and cons of social networking for scientists
Rare pediatric and neglected tropical diseases priority review voucher and tr...
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Infographic for Sanfilippo Syndrome IIIC and IIID
Cmt update-summer-newsletter-2015
The value of the pediatric and tropical disease voucher - The golden ticket

Recently uploaded (20)

PPTX
Seminar Hypertension and Kidney diseases.pptx
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PDF
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
PDF
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
PPT
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPTX
PMR- PPT.pptx for students and doctors tt
PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PDF
CHAPTER 2 The Chemical Basis of Life Lecture Outline.pdf
PPTX
limit test definition and all limit tests
PDF
Wound infection.pdfWound infection.pdf123
PPTX
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PPT
Enhancing Laboratory Quality Through ISO 15189 Compliance
PPTX
SCIENCE 4 Q2W5 PPT.pptx Lesson About Plnts and animals and their habitat
PPT
Computional quantum chemistry study .ppt
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPTX
Introcution to Microbes Burton's Biology for the Health
PPT
Animal tissues, epithelial, muscle, connective, nervous tissue
PPTX
TORCH INFECTIONS in pregnancy with toxoplasma
Seminar Hypertension and Kidney diseases.pptx
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PMR- PPT.pptx for students and doctors tt
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
CHAPTER 2 The Chemical Basis of Life Lecture Outline.pdf
limit test definition and all limit tests
Wound infection.pdfWound infection.pdf123
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
Enhancing Laboratory Quality Through ISO 15189 Compliance
SCIENCE 4 Q2W5 PPT.pptx Lesson About Plnts and animals and their habitat
Computional quantum chemistry study .ppt
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Introcution to Microbes Burton's Biology for the Health
Animal tissues, epithelial, muscle, connective, nervous tissue
TORCH INFECTIONS in pregnancy with toxoplasma

Bigger Data to Increase Drug Discovery

  • 1. Bigger Data to Increase Drug DiscoveryBigger Data to Increase Drug Discovery Sean EkinsSean Ekins Phoenix Nest, Inc., Brooklyn, NY. Collaborations in Chemistry, Inc., Fuquay Varina, NC. Collaborative Drug Discovery, Inc., Burlingame, CA. Collaborations Pharmaceuticals, Inc., Fuquay Varina, NC.
  • 2. In a Perfect World… • All major diseases cured • All > 7000 rare diseases have treatments available • Neglected diseases are eradicated • Antibiotics, antivirals, vaccines developed to anticipate all future mutations • Drug resistance eradicated • All research coordinated globally • Government/individuals collaboration- discovers / fund all research • Billions of molecules will be available with data for different targets • All decisions will involve machine learning • Life expectancy is infinite
  • 6. Ebola- related tweets in a 6 week period 2014 Robert Moore
  • 7. Why ‘Bigger’ and not ‘Big’
  • 8. Just a matter of scale? Drug Discovery’s definition of Big data Everyone else’s definition of Big data
  • 9. What about Chemistry and Biology - Pharmacology X.0 • Data Sources • PubChem • ChEMBL • ToxCast over 1800 molecules tested against over 800 endpoints
  • 12. But what about small data? • In some cases its all we have • In vivo data is not high throughput • Small data builds networks DATA V http://smalldatagroup.com/
  • 13. The past • 1996 • Data from low throughput Drug-drug interaction studies • E.g. Ki values with CYP 3A4 • A drug company might have 10s of values • This data was used to build 3D QSAR, pharmacophores JPET, 290: 429-438, 1999
  • 14.   Hydrophobi c features  (HPF) Hydrogen  bond  acceptor  (HBA) Hydrogen  bond  donor  (HBD) Observed  vs.  predicted  IC50 r Acoustic mediated process 2 1 1 0.92 Tip-based process 0 2 1 0.80 Acoustic Tip based Generated with Discovery Studio Generated with Discovery Studio  (Accelrys)(Accelrys) Cyan = hydrophobicCyan = hydrophobic Green = hydrogen bond acceptorGreen = hydrogen bond acceptor Purple = hydrogen bond donorPurple = hydrogen bond donor Each model shows most potent Each model shows most potent  molecule mappingmolecule mapping How you dispense liquids may be important: insights from small dataHow you dispense liquids may be important: insights from small data PLoS ONE 8(5): e62325 (2013)
  • 15. Ebola inhibitor Pharmacophore Ekins S, Freundlich JS and Coffee M F1000Research 2014, 3:277 Docking FDA approved compounds in VP35 protein showing overlap with ligand (yellow) Proposed amodiaquine, chloroquine, clomiphene toremifene Which all are active in vitro may have common features and bind common site / target A common feature pharmacophore for FDA-approved drugs inhibiting the Ebola virus
  • 16. The last 5 years -Present • 2010 • Data from high throughput screens at Pfizer • E.g. metabolic stability data ~200K compounds • This data was used to build machine learning models • 2015 • Could easily be double this amount Drug Metab Dispos, 38: 2083-2090, 2010
  • 17. Ebola Machine Learning Models Models  (training set  868  compounds) RP Forest  (Out of  bag ROC) RP Single  Tree (With 5  fold cross  validation  ROC) SVM (with 5 fold  cross  validation  ROC)  Bayesian  (with 5 fold  cross  validation  ROC) Bayesian  (leave out  50% x 100  ROC)  Open Bayesian (with 5 fold  cross  validation  ROC) Ebola  replication  (actives = 20) 0.70 0.78 0.73 0.86 0.86 0.82 Ebola  Pseudotype  (actives = 41) 0.85 0.81 0.76 0.85 0.82 0.82 Ekins, Freundlich, Madrid and Clark
  • 19. Tuberculosis still kills 1.6-1.7m/yr (~1 every 8 seconds) 1/3rd of worlds population infected!!!! streptomycin (1943)streptomycin (1943) para-para-aminosalicyclic acid (1949)aminosalicyclic acid (1949) isoniazid (1952)isoniazid (1952) pyrazinamide (1954)pyrazinamide (1954) cycloserine (1955)cycloserine (1955) ethambutol (1962)ethambutol (1962) rifampicin (1967)rifampicin (1967) Multi drug resistance in 4.3% of casesMulti drug resistance in 4.3% of cases Extensively drug resistant increasingExtensively drug resistant increasing incidenceincidence 2 new drugs (bedaquiline, delamanid)2 new drugs (bedaquiline, delamanid) in 40 yrsin 40 yrs Tuberculosis – a big diseaseTuberculosis – a big disease
  • 20. Tested >350,000 moleculesTested >350,000 molecules      Tested ~2M            2M     Tested ~2M            2M     >300,000    >300,000 >1500 active and non toxic>1500 active and non toxic     Published 177        100s    Published 177        100s         800         800  Big Data: Screening for New Tuberculosis TreatmentsBig Data: Screening for New Tuberculosis Treatments  How many will become a new drug? How do we learn from this big data? TBDA screened over 1 million, 1 million  more to go TB Alliance + Japanese pharma screens
  • 21. Over 8000 molecules with dose response data for Mtb in CDD Public from NIAID/SRI https://app.collaborativedrug.com/register
  • 22. Over 6 years analyzed in vitro data and built models Top scoring molecules assayed for Mtb growth inhibition Mtb screening molecule database/s High-throughput phenotypic Mtb screening Descriptors + Bioactivity (+Cytotoxicity) Bayesian Machine Learning classification Mtb Model Molecule Database (e.g. GSK malaria actives) virtually scored using Bayesian Models New bioactivity data may enhance models Identify in vitro hits and test models3 x published prospective tests ~750~750 molecules were testedmolecules were tested in vitroin vitro 198 actives were identified198 actives were identified >20 % hit rate>20 % hit rate Multiple retrospective tests 3-10 fold enrichment N H S N Ekins et al., Pharm Res 31: 414-435, 2014 Ekins, et al., Tuberculosis 94; 162-169, 2014 Ekins, et al., PLOSONE 8; e63240, 2013 Ekins, et al., Chem Biol 20: 370-378, 2013 Ekins, et al., JCIM, 53: 3054−3063, 2013 Ekins and Freundlich, Pharm Res, 28, 1859-1869, 2011 Ekins et al., Mol BioSyst, 6: 840-851, 2010 Ekins, et al., Mol. Biosyst. 6, 2316-2324, 2010,
  • 23. 5 active compounds vs Mtb in a few months 7 tested, 5 active (70% hit rate) Ekins et al.,Chem Biol 20, 370–378, 2013 1. Virtually screen 13,533-member GSK antimalarial hit library 2. Bayesian Model = SRI TAACF-CB2 dose response + cytotoxicity model 3. Top 46 commercially available compounds visually inspected 4. 7 compounds chosen for Mtb testing based on - drug-likeness - chemotype diversity GSK # Bayesian Score Chemical Structure Mtb H37Rv MIC (µg/mL) GSK Reported % Inhibition HepG2 @ 10 µM cmpd TCMDC- 123868 5.73 >32 40 TCMDC- 125802 5.63 0.0625 5 TCMDC- 124192 5.27 2.0 4 TCMDC- 124334 5.20 2.0 4 TCMDC- 123856 5.09 1.0 83 TCMDC- 123640 4.66 >32 10 TCMDC- 124922 4.55 1.0 9
  • 24. Filling out the triazine matrix using SARtable: A new kind of map Green = good activity, Red = bad; colored dots are predictions
  • 25. No relationship between internal or external ROC and the number of molecules in the training set? PCA of combined data and ARRA(red) Ekins et al., J Chem Inf Model 54: 2157-2165 (2014) Internal and leave out 50%x100 ROC track each other External ROC less correlation Smaller models do just as well with external testing ~350,000
  • 26. What matters most >70 years of TB mouse in vivo data – Mind the gap - 770 molecules MIND THE TB GAP Ekins et al., J Chem Inf Model 54: 1070-82, 2014 Ekins, Nuermberger & Freundlich DDT 19: 1279-1282, 2014
  • 27. In vivo Machine Learning Models ROC 5 fold cross validation RP Forest RP Single Tree SVM Bayesian 3 /11 (27.2%) 4/11 (36.4%) 7/11 (63.6%) 8/11 (72.7%) External test set Ekins et al., J Chem Inf Model 54: 1070-82, 2014 RP Forest RP Single Tree SVM Bayesian 0.75 0.71 0.77 0.73
  • 28. ow can we find the in vivo active compound e need a map..
  • 29. >70 years of TB in vivo data Green = in vivo mouse active Empty = in vivo inactive Yellow = 2013-2015 data Uses Bayesian fingerprints and clustering by similarity Clark and Ekins - unpublished Clustering in vivo mouse TB dataHex plot
  • 30. >70 years of TB in vivo data Green = in vivo mouse active Empty = in vivo inactive Yellow = 2013-2015 Clark and Ekins - unpublished Clustering in vivo mouse TB data Triazine surrounded by inactives Issues High Log P, poor solubility
  • 31. How do we ‘increase drug discovery’? • Make data and models more accessible • Collaborate • Share – Create mobile apps • Encourage engagement from non scientists
  • 32. MoDELSRESIDE IN PAPERS NOT ACCESSIBLE…THISIS UNDESIRABLE How do wesharethem? How do weuseThem?
  • 33. • CDD Vision Uses Bayesian algorithm and FCFP_6 fingerprints Bayesian models Clark et al., J Cheminform 6:38 2014
  • 34. Predictions for the InhA target: (a) the ROC curve with ECFP_6 and FCFP_6Predictions for the InhA target: (a) the ROC curve with ECFP_6 and FCFP_6 fingerprints; (b) modified Bayesian estimators for active and inactive compounds;fingerprints; (b) modified Bayesian estimators for active and inactive compounds; (c) structures of selected binders.(c) structures of selected binders. For each listed target with at least two binders, it is first assumed that all of theFor each listed target with at least two binders, it is first assumed that all of the molecules in the collection that do not indicate this as one of their targets aremolecules in the collection that do not indicate this as one of their targets are inactive.inactive. In the app we used ECFP_6 fingerprintsIn the app we used ECFP_6 fingerprints Building Bayesian models for each target in TB MobileBuilding Bayesian models for each target in TB Mobile Clark et al., J Cheminform 6:38 2014
  • 35. TB Mobile Vers.2TB Mobile Vers.2 Ekins et al., J Cheminform 5:13, 2013 Clark et al., J Cheminform 6:38 2014 Predict targets Cluster molecules http://goo.gl/vPOKS http://goo.gl/iDJFR
  • 36. Predictions for 2013-2015 in vivo molecules
  • 37. Bayesian models added to mobile apps: MMDS
  • 38. Bayesian models added to mobile apps: Approved drugs
  • 39. Human Microsomal Intrinsic clearance Human protein binding Solubility pH 7.4 AZ dataset models >1000 molecules
  • 40. Models from ChEMBL data http://molsync.com/bayesian2
  • 41. What do 2000 ChEMBL models look like Folding bit size Average ROC http://molsync.com/bayesian2
  • 42. Bigger datasets and model collections • Profiling “big datasets” is going to be the norm. • A recent study mined PubChem datasets for compounds that have rat in vivo acute toxicity data • This could be used in other big data initiatives like ToxCast (> 1000 compounds x 800 assays) and Tox21 etc. • Kinase screening data (1000s mols x 100s assays) • GPCR datasets etc (1000s mols x 100s assays) Zhang J, Hsieh JH, Zhu H (2014) Profiling Animal Toxicants by Automatically Mining Public Bioassay Data: A Big Data Approach for Computational Toxicology. PLoS ONE 9(6): e99863. doi:10.1371/journal.pone.0099863 http://127.0.0.1:8081/plosone/article?id=info:doi/1
  • 43. • Data is at your fingertips instantly • labs add data to a massive corpus of knowledge • Instantly available to all • Algorithms for mining, prediction • Millions of models accessible • Making decisions on experiments needed and running them • Data visualization, exploration is real-time, updated • Data follows you Sean Ekins, a computational drug discovery consultant at Collaborations in Chemistry in North Carolina, is much more skeptical. He notes pharma companies have found hundreds of antimalaria compounds more potent than TNP-470 and says that he is not convinced Eve can do QSAR. He wants to see Eve go head-to-head with a real computational chemist. “Eve should go back to the Garden of Eden and leave drug discovery to scientists who know what they are doing,” Ekins says. How close are we?
  • 44. • Computers and models do not replace scientists • A tool to help us sift through ideas quickly • Many examples have lead to leads • Bigger data not needed for good models • More data becoming public • Can model ADME, bioactivity and more • Collaboration and software is important • Mobile apps have useful cheminformatics features - aid anyone to do drug discovery • Models are compact < 1MB and portable • The age of model sharing is here Conclusions
  • 45. Wanted • “Bigger” small moleculescreening datasets • Preferably > 500,000 – 1,000,000 moleculeswith data • To test how machinelearningAlgorithmsScale • Contact ekinssean@yahoo.com
  • 46. Nadia Litterman, Krishna Dole and all at CDD, Megan Coffee, SRI, MM4TB and manyNadia Litterman, Krishna Dole and all at CDD, Megan Coffee, SRI, MM4TB and many others …Funding:others …Funding: Bill and Melinda Gates Foundation (Grant#49852)Bill and Melinda Gates Foundation (Grant#49852) 1R41AI088893-01,1R41AI088893-01, 2R42AI088893-02, R43 LM011152-01,2R42AI088893-02, R43 LM011152-01, 9R44TR000942-02, 1R41AI108003-01, 1U19AI109713-01, MM4TB, Software: BioviaMM4TB, Software: Biovia Freundlich Lab

Editor's Notes

  • #15: You do not need big data to show fundamental observations