SlideShare a Scribd company logo
Dealing with ‘exotic’ similarity
metrics
How to set up a (ChemAxon-powered)
Similarity-driven Virtual Screening server…
Dragos Horvath, dhorvath@unistra.fr
UMR 7140 CNRS – Université de Strasbourg
Introduction & Definitions
• Similarity-based Virtual Screening (SVS):
– Search, in a database of candidates m for similar analogues
of a query compound M of wanted properties, hoping that
the « similarity principle » magic will operate.
• Molecular Similarity S(M,m):
– distance (metric) between the two Descriptor Space (DS)
points 𝐷 𝑀 , 𝐷 𝑚 - let us call these 𝐷, 𝑑, for simplicity.
• Similarity Radius s defines « how similar is similar »
– Delimits a sphere in descriptor space around M, thought to
contain a minimum of inactive, but a maximum of active m.
• Virtual Hits – aka True & False « Positives » (TP,FP):
– Compounds m with S(M,m)<s
Compound Sets
• For server calibration:
– Candidate database: 165 ChEMBL ligand sets with >50
molecules of reported pKi values with respect to the 165
associated receptors & enzymes (targets T).
– Queries of T: MT
1, MT
2 … MT
i, i=1..QT is composed of the top
1/5 (max 100) actives on T, plus 1/5 (max 100) of binders of
medium potency, can be classified by pharmacophore
complexity (Nr. of populated FPT1 triplets)
– 10,000 randomly picked commercial molecules from ZINC,
assumed to be inactive “decoys”.
• Operational database:
– 1.5 M commercial compounds, from various sources
– Above « reference » molecules, for annotation purposes
Compound Sets
Compound Sets
Descriptor Spaces
Descriptor Spaces
All are Feature Counts
Di(M) = integer (positive
or null) population level
of « feature » i (a
substructure or a
pharmacophore triplet)
in molecule M
Descriptor Spaces
Dissimilarity Scores…
• Based on the comparison of descriptor vectors 𝐷, 𝑑
𝑁 𝑀 𝑁𝑂𝑅𝑀(𝑀) = 𝐷𝑖
2
𝑁(𝑀)
𝑖=1
𝑁𝐴𝑁𝐷 𝑚, 𝑀 𝐴𝑁𝐷(𝑚, 𝑀) = 𝐷𝑖 × 𝑑𝑖
𝑁 𝑂𝑅(𝑚,𝑀)
𝑖=1
𝑁𝐸𝑋𝐶 𝑀, 𝑚 𝐸𝑋𝐶(𝑀, 𝑚) = 𝐷𝑖
2
𝑖|𝑑 𝑖=0
Euclidean & Related…
𝐸 𝑚, 𝑀 = 𝐷𝑖 − 𝑑𝑖
2
𝑁 𝑂𝑅(𝑚,𝑀)
𝑖=1
𝑅 𝑚, 𝑀 =
𝐷𝑖 − 𝑑𝑖
2𝑁 𝑂𝑅(𝑚,𝑀)
𝑖=1
𝑁 𝑂𝑅(𝑚, 𝑀)
𝐴 𝑚, 𝑀 =
𝐷𝑖 − 𝑑𝑖
𝑁 𝑂𝑅(𝑚,𝑀)
𝑖=1
𝑁 𝑂𝑅(𝑚, 𝑀)
𝑅𝑊 𝑚, 𝑀 = 𝑅 𝑚, 𝑀
𝑁 𝐸𝑋𝐶(𝑚, 𝑀) + 𝑁𝐸𝑋𝐶(𝑀, 𝑚)
𝑁 𝑂𝑅(𝑚, 𝑀)
𝐴𝑊 𝑚, 𝑀 = 𝐴 𝑚, 𝑀
𝑁 𝐸𝑋𝐶(𝑚, 𝑀) + 𝑁𝐸𝑋𝐶(𝑀, 𝑚)
𝑁 𝑂𝑅(𝑚, 𝑀)
(A)Symmetric Correlation Scores –
Tanimoto & Tversky
𝑇 𝑀, 𝑚 = 1 −
𝐴𝑁𝐷(𝑀, 𝑚)
𝑁𝑂𝑅𝑀(𝑀) + 𝑁𝑂𝑅𝑀(𝑚) − 𝐴𝑁𝐷(𝑀, 𝑚)
𝑇𝑣 𝑀, 𝑚, 𝛼 = 1 −
𝐴𝑁𝐷(𝑀, 𝑚)
𝛼𝐸𝑋𝐶(𝑀, 𝑚) + 1 − 𝛼 𝐸𝑋𝐶(𝑚, 𝑀) + 𝐴𝑁𝐷(𝑀, 𝑚)
Situations where:
(a) candidate m misses a feature seen in active
M, and
(b) it contains some novel feature not seen in M
may be distinguished! At a>0.5, cases (a) will be
relatively more penalized than the symmetric
situation (b).
A raw guess of a should suffice! Three
implementations of Tv are considered:
• Tv+ (a=0.9)
• Tv (a=0.7)
• Tv- (a=0.3)
2. Fine, but « how similar is similar »?
• You may be a believer of the dogma « Tanimoto>0.85 »
(T<0.15)
– But the Bible mentions not the other metrics, less subjected
to religious fervor.
• Alternatively, try to infer reasonable choices of
similarity radii for each Chemical Space (CS – the
combination of Descriptor Space & Similarity score)
– For each query, on every target, compute s* corresponding
to the « optimal » SVS scenario.
– This also allows to measure & benchmark SVS success with
respect to its Operational Premises (CS, nature of Target,
degree of complexity of the query, etc).
s
W
1.0
)()(
)()(
)( E
FN
E
FP
FNFP
NN
NN
s


W
SS
S


A basic SVS Optimality Criterion: W
L(M,m) l L(M,m)> l
S(M,m)s
True
Positives
(TP)  
False
Positives
(FP) 
False (?)
Negatives
(FN) 
True
Negatives
(TN) 
)()(
)()(
)( E
FN
E
FP
FNFP
NN
NN
s


W
SS
S


s
Activity (profile) differences L(m,M)
Λ 𝑀, 𝑚 =
0 𝑖𝑓 𝑝𝐾𝑖(𝑀) − 𝑝𝐾𝑖(𝑚) < 0.5
1 𝑖𝑓 𝑝𝐾𝑖(𝑀) − 𝑝𝐾𝑖(𝑚) > 3.0
𝑝𝐾𝑖(𝑀) − 𝑝𝐾𝑖(𝑚) − 0.5
2.5
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
s
W
1.0
)()(
)()(
)( E
FN
E
FP
FNFP
NN
NN
s


W
SS
S


A basic SVS Optimality Criterion: W
L(M,m) l L(M,m)> l
S(M,m)s
True
Positives
(TP)  
False
Positives
(FP) 
False (?)
Negatives
(FN) 
True
Negatives
(TN) 
)()(
)()(
)( E
FN
E
FP
FNFP
NN
NN
s


W
SS
S


s
Activity (profile) differences L(m,M)
The Ascertained Optimality Excess X
Compound Pairs selected at cutoff s
Random S
values
Meaningful
S values
Var(W)W
X
Fraction of Compound Pairs selected at cutoff s
     sVars randrand
WWWX
X
Workflow
ForEach Target T
Set database db=set of tested ligands (known pKi ) + decoy set (pKi=0);
ForEach Query M of T
ForEach DescriptorSpace D
ForEach SimilarityScore S
# Start Current SVS experiment defined by Target, Query, Descriptors & Similarity Score
ForEach m!=M in db
Calculate S(M,m)|D ;
EndLoop(m)
Scan over s → X(s) and return s* such that X(s*)=maximal;
Classify SVS(T,M,D,S) wrt X(s*) as « failed », « acceptable », « good » or « excellent » ;
EndLoop S;
EndLoop D;
EndLoop M;
EndLoop T;
Analyze Success Rates & s* distributions in terms of various Operational premises (nature of T,
complexity of M, choice of D, of S or of D-S combinations)
Insights: (1) – So much for dogmas!
0
5
10
15
20
25
30
35
40
0.04 0.08 0.12 0.16 0.2 0.24 0.28 0.32 0.36 0.4 0.44 0.48 0.52 0.56 0.6
%ofTanimoto-basedqueriesofGoodoptimalitylevelatd*
d*
FPT1
treeSY03
s*
Use distribution to
« teach » the web
server how to
rank prospective
SVS hits!
Top Hits (0)
Good
Hits (1)
Average
Hits (2)
Acceptable
Hits (3)
Are these Hits? (4) Ignore…
Top Hits (0)
Good
Hits (1)
Average
Hits (2)
Acceptable
Hits (3)
Are these Hits? (4) Ignore…
Insights: (2) – Tversky at a>0.5: an
excellent similarity scoring scheme.
0
2
4
6
8
10
12
14
16
Tv+ Tv T RW AW Tv- E A R
R:all-acceptable
R:all-good
R:all-excellent
Relative«marketshare»ofmetric:fractionofSVSrunsbasedonshown
metric,outofallSVSexperimentshavingreachedgivensuccesslevels
Tv+ may pick actives that are more
complex than queries (NK1 example)
T
Insights: (3) – Trends with respect to
target classes could be evidenced…
Relative«marketshare»ofmetric:fractionofSVSruns–withintargetclasses-based
onshownmetric,outofallSVSexperimentshavingreachedgivensuccesslevels
0
2
4
6
8
10
12
14
16
Tv+ Tv T RW AW Tv- E A R
R:all-good
R:Kinases-good
R:monoamineGPCR-good
R:otherGPCR-good
Insights: (4) – when the query compound is
complex, the metric matters less
Relative«marketshare»ofmetric:fractionofSVSruns–withinquerycomplexity
classes-basedonshownmetric,outofallSVShavingreachedgivensuccesslevels
0
2
4
6
8
10
12
14
16
Tv+ Tv T RW AW Tv- E A R
R:all-good
R:pharm-high-good
R:pharm-low-good
Some conclusions
• The study has highlighted many interesting aspects
– Intrinsic usefulness of Tversky scores biased towards of query feature loss
penalty: a=0.9…0.7 will do!
– Other target-, query complexity-, query activity-, descriptor space-dependent
trends of the SVS success
– Some inevitable sources of bias, showing that not even ChEMBL is not
large/diverse enough to cover it all…
• Main message: use this protocol – or related – to calibrate web
servers, rather than sticking to well-studied metrics and descriptors
for which « Universal » similarity cutoffs are believed to hold.
• Try infochim.u-strasbg.fr/webserv/VSEngine.html – to our
knowledge, the only public SVS server to support atypical, but
powerful metrics coupled to chemically relevant, pH-sensitive
descriptor spaces… all while exploiting the power of ChemAxon
tools!

More Related Content

PDF
EUGM 2013 - Andrea de Souza (Broad Institute): Setting the stage for the “SD”...
PDF
EUGM 2013 - Miklos Szabo (ChemAxon) - Recent Successful Discovery Strategies ...
PDF
EUGM 2013 - Sergio H. Rotstein (Pfizer): What about the “big guys”? The emerg...
PDF
EUGM 2013 - David Deng, Daniel Bonniot (ChemAxon) - What’s New with Naming
PDF
EUGM 2013 - Roland Knispel (ChemAxon) - Biologics at ChemAxon From Old Powerh...
PDF
EUGM 2013 - Gabor Guta (ChemAxon) - JChem Web Services
PDF
EUGM 2013 - Andras Stracz (ChemAxon) - ChemAxon Plexus: A desktop application...
PDF
EUGM 2013 - Eufrozina Hoffmann (ChemAxon): Marvin extending the scope of usab...
EUGM 2013 - Andrea de Souza (Broad Institute): Setting the stage for the “SD”...
EUGM 2013 - Miklos Szabo (ChemAxon) - Recent Successful Discovery Strategies ...
EUGM 2013 - Sergio H. Rotstein (Pfizer): What about the “big guys”? The emerg...
EUGM 2013 - David Deng, Daniel Bonniot (ChemAxon) - What’s New with Naming
EUGM 2013 - Roland Knispel (ChemAxon) - Biologics at ChemAxon From Old Powerh...
EUGM 2013 - Gabor Guta (ChemAxon) - JChem Web Services
EUGM 2013 - Andras Stracz (ChemAxon) - ChemAxon Plexus: A desktop application...
EUGM 2013 - Eufrozina Hoffmann (ChemAxon): Marvin extending the scope of usab...

Viewers also liked (14)

PDF
EUGM 2013 - Odon Farkas (Eotvos University) - Conformation search via cool dy...
PDF
EUGM 2013 - Anh Kiet Tran Minh (CNRS): French Academic Compound Library: the ...
PDF
EUGM 2013 - Attila Szabo (ChemAxon) - Collaborate and search in SharePoint wi...
PDF
EUGM 2013 - Steve Hajkowski (Thomson Reuters): Patent analytics - what can Ma...
PDF
EUGM 2013 - Peter Englert, Peter Kovacs (ChemAxon) - The Next Generation of M...
PDF
EUGM 2013 - Michael Dippolito (Deltasoft): Great Migrations! – Approaches to ...
PDF
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
PDF
EUGM 2013 - Björn Windshügel (European ScreeningPort): Chemoinformatic tools ...
PDF
EUGM 2013 - Gyorgy Pirok (ChemAxon) - Prediction of Xenobiotic Metabolism
PDF
EUGM 2013 - Ian Berry, Bob Marmon (Evotec): Classification and analysis of 21...
PDF
EUGM 2013 - Bernd Rupp (FMP) Chemical Information systems: From compound coll...
PDF
EUGM 2013 - Jon Patterson (ChemAxon) ChemAxon Platform for Scientists
PDF
EUGM 2013 - Anna Tomin (ChemAxon) - Reaction Library Design
PDF
EUGM 2013 - Timea Polgar (ChemAxon) - 3D visualization for medicinal chemists
EUGM 2013 - Odon Farkas (Eotvos University) - Conformation search via cool dy...
EUGM 2013 - Anh Kiet Tran Minh (CNRS): French Academic Compound Library: the ...
EUGM 2013 - Attila Szabo (ChemAxon) - Collaborate and search in SharePoint wi...
EUGM 2013 - Steve Hajkowski (Thomson Reuters): Patent analytics - what can Ma...
EUGM 2013 - Peter Englert, Peter Kovacs (ChemAxon) - The Next Generation of M...
EUGM 2013 - Michael Dippolito (Deltasoft): Great Migrations! – Approaches to ...
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
EUGM 2013 - Björn Windshügel (European ScreeningPort): Chemoinformatic tools ...
EUGM 2013 - Gyorgy Pirok (ChemAxon) - Prediction of Xenobiotic Metabolism
EUGM 2013 - Ian Berry, Bob Marmon (Evotec): Classification and analysis of 21...
EUGM 2013 - Bernd Rupp (FMP) Chemical Information systems: From compound coll...
EUGM 2013 - Jon Patterson (ChemAxon) ChemAxon Platform for Scientists
EUGM 2013 - Anna Tomin (ChemAxon) - Reaction Library Design
EUGM 2013 - Timea Polgar (ChemAxon) - 3D visualization for medicinal chemists
Ad

Similar to EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg-CNRS): Dealing with 'exotic' similarity metrics - live on the Web (20)

PDF
ML MODULE 2.pdf
PDF
Lect w8 w9_correlation_regression
PPTX
Topic 1 part 2
PPTX
Bioinformatics t5-database searching-v2013_wim_vancriekinge
PPTX
Hypothesis Testing
PPTX
PPT
BIIntro.ppt
PDF
Basic Inference Analysis
PPT
BIIntroduction. on business intelligenceppt
PPT
Business Intelligence and Data Analytics.ppt
PPT
natural language processing by Christopher
PDF
Applied machine learning for search engine relevance 3
PDF
1015 track2 abbott
PDF
1030 track2 abbott
PPTX
Statistical Analysis with R -I
PPT
B.sc biochem i bobi u 3.1 sequence alignment
PPT
B.sc biochem i bobi u 3.1 sequence alignment
PPT
Bioinformatica 10-11-2011-t5-database searching
PDF
Knowledge extraction from support vector machines
PDF
ML MODULE 2.pdf
Lect w8 w9_correlation_regression
Topic 1 part 2
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Hypothesis Testing
BIIntro.ppt
Basic Inference Analysis
BIIntroduction. on business intelligenceppt
Business Intelligence and Data Analytics.ppt
natural language processing by Christopher
Applied machine learning for search engine relevance 3
1015 track2 abbott
1030 track2 abbott
Statistical Analysis with R -I
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
Bioinformatica 10-11-2011-t5-database searching
Knowledge extraction from support vector machines
Ad

More from ChemAxon (20)

PPTX
Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?
PDF
Chemaxon EU UGM 2022 | Translating data to predictive models
PDF
Translating data to predictive models
PDF
Efficient biomolecular structural data handling and analysis - Webinar with D...
PDF
Biomolecule structural data management
PPTX
Cheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first release
PDF
Enhanced stereochemistry representation
PDF
Intellectual property (IP) intelligence solutions designed for the way resear...
PDF
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
PDF
Patent Data for Artificial Intelligence based Drug Discovery
PPTX
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
PDF
Research data management on the cloud
PDF
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
PDF
Cheminfo Stories APAC 2020 - JChem Engines introduction
PDF
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
PDF
Cheminfo Stories APAC 2020 -- Markush technology
PDF
JChem Microservices
PDF
Migration from joc to jpc or choral
PPTX
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
PPTX
Chemicalize Pro - Cheminfo Stories 2020 Day 5
Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?
Chemaxon EU UGM 2022 | Translating data to predictive models
Translating data to predictive models
Efficient biomolecular structural data handling and analysis - Webinar with D...
Biomolecule structural data management
Cheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first release
Enhanced stereochemistry representation
Intellectual property (IP) intelligence solutions designed for the way resear...
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
Patent Data for Artificial Intelligence based Drug Discovery
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
Research data management on the cloud
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
Cheminfo Stories APAC 2020 - JChem Engines introduction
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
Cheminfo Stories APAC 2020 -- Markush technology
JChem Microservices
Migration from joc to jpc or choral
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
Chemicalize Pro - Cheminfo Stories 2020 Day 5

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Cloud computing and distributed systems.
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
NewMind AI Monthly Chronicles - July 2025
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Encapsulation_ Review paper, used for researhc scholars
Understanding_Digital_Forensics_Presentation.pptx
Big Data Technologies - Introduction.pptx
Cloud computing and distributed systems.
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
20250228 LYD VKU AI Blended-Learning.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Electronic commerce courselecture one. Pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Reach Out and Touch Someone: Haptics and Empathic Computing
Spectral efficient network and resource selection model in 5G networks
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Per capita expenditure prediction using model stacking based on satellite ima...

EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg-CNRS): Dealing with 'exotic' similarity metrics - live on the Web

  • 1. Dealing with ‘exotic’ similarity metrics How to set up a (ChemAxon-powered) Similarity-driven Virtual Screening server… Dragos Horvath, dhorvath@unistra.fr UMR 7140 CNRS – Université de Strasbourg
  • 2. Introduction & Definitions • Similarity-based Virtual Screening (SVS): – Search, in a database of candidates m for similar analogues of a query compound M of wanted properties, hoping that the « similarity principle » magic will operate. • Molecular Similarity S(M,m): – distance (metric) between the two Descriptor Space (DS) points 𝐷 𝑀 , 𝐷 𝑚 - let us call these 𝐷, 𝑑, for simplicity. • Similarity Radius s defines « how similar is similar » – Delimits a sphere in descriptor space around M, thought to contain a minimum of inactive, but a maximum of active m. • Virtual Hits – aka True & False « Positives » (TP,FP): – Compounds m with S(M,m)<s
  • 3. Compound Sets • For server calibration: – Candidate database: 165 ChEMBL ligand sets with >50 molecules of reported pKi values with respect to the 165 associated receptors & enzymes (targets T). – Queries of T: MT 1, MT 2 … MT i, i=1..QT is composed of the top 1/5 (max 100) actives on T, plus 1/5 (max 100) of binders of medium potency, can be classified by pharmacophore complexity (Nr. of populated FPT1 triplets) – 10,000 randomly picked commercial molecules from ZINC, assumed to be inactive “decoys”. • Operational database: – 1.5 M commercial compounds, from various sources – Above « reference » molecules, for annotation purposes
  • 8. All are Feature Counts Di(M) = integer (positive or null) population level of « feature » i (a substructure or a pharmacophore triplet) in molecule M Descriptor Spaces
  • 9. Dissimilarity Scores… • Based on the comparison of descriptor vectors 𝐷, 𝑑 𝑁 𝑀 𝑁𝑂𝑅𝑀(𝑀) = 𝐷𝑖 2 𝑁(𝑀) 𝑖=1 𝑁𝐴𝑁𝐷 𝑚, 𝑀 𝐴𝑁𝐷(𝑚, 𝑀) = 𝐷𝑖 × 𝑑𝑖 𝑁 𝑂𝑅(𝑚,𝑀) 𝑖=1 𝑁𝐸𝑋𝐶 𝑀, 𝑚 𝐸𝑋𝐶(𝑀, 𝑚) = 𝐷𝑖 2 𝑖|𝑑 𝑖=0
  • 10. Euclidean & Related… 𝐸 𝑚, 𝑀 = 𝐷𝑖 − 𝑑𝑖 2 𝑁 𝑂𝑅(𝑚,𝑀) 𝑖=1 𝑅 𝑚, 𝑀 = 𝐷𝑖 − 𝑑𝑖 2𝑁 𝑂𝑅(𝑚,𝑀) 𝑖=1 𝑁 𝑂𝑅(𝑚, 𝑀) 𝐴 𝑚, 𝑀 = 𝐷𝑖 − 𝑑𝑖 𝑁 𝑂𝑅(𝑚,𝑀) 𝑖=1 𝑁 𝑂𝑅(𝑚, 𝑀) 𝑅𝑊 𝑚, 𝑀 = 𝑅 𝑚, 𝑀 𝑁 𝐸𝑋𝐶(𝑚, 𝑀) + 𝑁𝐸𝑋𝐶(𝑀, 𝑚) 𝑁 𝑂𝑅(𝑚, 𝑀) 𝐴𝑊 𝑚, 𝑀 = 𝐴 𝑚, 𝑀 𝑁 𝐸𝑋𝐶(𝑚, 𝑀) + 𝑁𝐸𝑋𝐶(𝑀, 𝑚) 𝑁 𝑂𝑅(𝑚, 𝑀)
  • 11. (A)Symmetric Correlation Scores – Tanimoto & Tversky 𝑇 𝑀, 𝑚 = 1 − 𝐴𝑁𝐷(𝑀, 𝑚) 𝑁𝑂𝑅𝑀(𝑀) + 𝑁𝑂𝑅𝑀(𝑚) − 𝐴𝑁𝐷(𝑀, 𝑚) 𝑇𝑣 𝑀, 𝑚, 𝛼 = 1 − 𝐴𝑁𝐷(𝑀, 𝑚) 𝛼𝐸𝑋𝐶(𝑀, 𝑚) + 1 − 𝛼 𝐸𝑋𝐶(𝑚, 𝑀) + 𝐴𝑁𝐷(𝑀, 𝑚) Situations where: (a) candidate m misses a feature seen in active M, and (b) it contains some novel feature not seen in M may be distinguished! At a>0.5, cases (a) will be relatively more penalized than the symmetric situation (b). A raw guess of a should suffice! Three implementations of Tv are considered: • Tv+ (a=0.9) • Tv (a=0.7) • Tv- (a=0.3)
  • 12. 2. Fine, but « how similar is similar »? • You may be a believer of the dogma « Tanimoto>0.85 » (T<0.15) – But the Bible mentions not the other metrics, less subjected to religious fervor. • Alternatively, try to infer reasonable choices of similarity radii for each Chemical Space (CS – the combination of Descriptor Space & Similarity score) – For each query, on every target, compute s* corresponding to the « optimal » SVS scenario. – This also allows to measure & benchmark SVS success with respect to its Operational Premises (CS, nature of Target, degree of complexity of the query, etc).
  • 13. s W 1.0 )()( )()( )( E FN E FP FNFP NN NN s   W SS S   A basic SVS Optimality Criterion: W L(M,m) l L(M,m)> l S(M,m)s True Positives (TP)   False Positives (FP)  False (?) Negatives (FN)  True Negatives (TN)  )()( )()( )( E FN E FP FNFP NN NN s   W SS S   s Activity (profile) differences L(m,M) Λ 𝑀, 𝑚 = 0 𝑖𝑓 𝑝𝐾𝑖(𝑀) − 𝑝𝐾𝑖(𝑚) < 0.5 1 𝑖𝑓 𝑝𝐾𝑖(𝑀) − 𝑝𝐾𝑖(𝑚) > 3.0 𝑝𝐾𝑖(𝑀) − 𝑝𝐾𝑖(𝑚) − 0.5 2.5 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
  • 14. s W 1.0 )()( )()( )( E FN E FP FNFP NN NN s   W SS S   A basic SVS Optimality Criterion: W L(M,m) l L(M,m)> l S(M,m)s True Positives (TP)   False Positives (FP)  False (?) Negatives (FN)  True Negatives (TN)  )()( )()( )( E FN E FP FNFP NN NN s   W SS S   s Activity (profile) differences L(m,M)
  • 15. The Ascertained Optimality Excess X Compound Pairs selected at cutoff s Random S values Meaningful S values Var(W)W X Fraction of Compound Pairs selected at cutoff s      sVars randrand WWWX X
  • 16. Workflow ForEach Target T Set database db=set of tested ligands (known pKi ) + decoy set (pKi=0); ForEach Query M of T ForEach DescriptorSpace D ForEach SimilarityScore S # Start Current SVS experiment defined by Target, Query, Descriptors & Similarity Score ForEach m!=M in db Calculate S(M,m)|D ; EndLoop(m) Scan over s → X(s) and return s* such that X(s*)=maximal; Classify SVS(T,M,D,S) wrt X(s*) as « failed », « acceptable », « good » or « excellent » ; EndLoop S; EndLoop D; EndLoop M; EndLoop T; Analyze Success Rates & s* distributions in terms of various Operational premises (nature of T, complexity of M, choice of D, of S or of D-S combinations)
  • 17. Insights: (1) – So much for dogmas! 0 5 10 15 20 25 30 35 40 0.04 0.08 0.12 0.16 0.2 0.24 0.28 0.32 0.36 0.4 0.44 0.48 0.52 0.56 0.6 %ofTanimoto-basedqueriesofGoodoptimalitylevelatd* d* FPT1 treeSY03 s* Use distribution to « teach » the web server how to rank prospective SVS hits! Top Hits (0) Good Hits (1) Average Hits (2) Acceptable Hits (3) Are these Hits? (4) Ignore… Top Hits (0) Good Hits (1) Average Hits (2) Acceptable Hits (3) Are these Hits? (4) Ignore…
  • 18. Insights: (2) – Tversky at a>0.5: an excellent similarity scoring scheme. 0 2 4 6 8 10 12 14 16 Tv+ Tv T RW AW Tv- E A R R:all-acceptable R:all-good R:all-excellent Relative«marketshare»ofmetric:fractionofSVSrunsbasedonshown metric,outofallSVSexperimentshavingreachedgivensuccesslevels
  • 19. Tv+ may pick actives that are more complex than queries (NK1 example) T
  • 20. Insights: (3) – Trends with respect to target classes could be evidenced… Relative«marketshare»ofmetric:fractionofSVSruns–withintargetclasses-based onshownmetric,outofallSVSexperimentshavingreachedgivensuccesslevels 0 2 4 6 8 10 12 14 16 Tv+ Tv T RW AW Tv- E A R R:all-good R:Kinases-good R:monoamineGPCR-good R:otherGPCR-good
  • 21. Insights: (4) – when the query compound is complex, the metric matters less Relative«marketshare»ofmetric:fractionofSVSruns–withinquerycomplexity classes-basedonshownmetric,outofallSVShavingreachedgivensuccesslevels 0 2 4 6 8 10 12 14 16 Tv+ Tv T RW AW Tv- E A R R:all-good R:pharm-high-good R:pharm-low-good
  • 22. Some conclusions • The study has highlighted many interesting aspects – Intrinsic usefulness of Tversky scores biased towards of query feature loss penalty: a=0.9…0.7 will do! – Other target-, query complexity-, query activity-, descriptor space-dependent trends of the SVS success – Some inevitable sources of bias, showing that not even ChEMBL is not large/diverse enough to cover it all… • Main message: use this protocol – or related – to calibrate web servers, rather than sticking to well-studied metrics and descriptors for which « Universal » similarity cutoffs are believed to hold. • Try infochim.u-strasbg.fr/webserv/VSEngine.html – to our knowledge, the only public SVS server to support atypical, but powerful metrics coupled to chemically relevant, pH-sensitive descriptor spaces… all while exploiting the power of ChemAxon tools!