SlideShare a Scribd company logo
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Recent improvements
to the rdkit
Roger Sayle
NextMove Software, Cambridge, UK
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Motivation: compound acquisition
• Given a existing screening collection of X
compounds, and with Y vendor compounds
available for purchase, how should I select
the next Z diverse compounds to buy.
• Typically, X is about 2M and Y is about 100M.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Rdkit’s maxminpicker
• Picking diverse compounds from large sets, 2014/08
• Optimizing Diversity Picking in the RDKit, 2014/08
• M. Ashton, J. Barnard, P. Willett et al., “Identification
of Diverse Database Subsets using Property-based
and Fragment-based Molecular Descriptors”, Quant.
Struct.-Act. Relat., Vol. 21, pp. 598-604, 2002.
• R. Kennard and L. Stone, “Computer aided design of
experiments”, Technometrics, Vol. 11, No. 1, pp. 137-
148, 1969.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Conceptual algorithm
• If no compounds have been picked so far, choose the
first picked compound at random.
• Repeatedly select the compound furthest from it’s
nearest picked compound [hence the name
maximum-minimum distance].
• Continue until the desired number of picked
compounds has been selected (or the pool of
available compounds has been exhausted).
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Selection visualization
Image Credits: Antoine Stevens, the ProspectR package on github
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Pros vs. cons
• Optimal picking is (NP-)Hard.
• Density vs. Diversity
– With biased data sets, random sampling follows density,
where MaxMin optimizes coverage.
– Picking != Clustering.
• Worst-case NN-1 bounds
– It’s possible to (estimate a) bound on the worst case
distance to nearest neighbor.
• Fraction of Data Set to be Sampled
• Scaling Performance of Algorithm
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Distance matrices are a bad idea
• Naïvely, one approach is to provide a picking
algorithm with a full distance matrix.
• If X compounds are picked already, and Y is the initial
pool to pick, from this requires (X+Y-1)*(X+Y)/2 time
and space.
• One goal is to keep memory requirements O(X+Y).
• An intelligent algorithm should be able to avoid
calculating any given distance more than once.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Taylor-butina vs. maxmin picking
• At a comparable distance threshold, MaxMin picks
are also a Leader (Tabu) clustering.
• As MaxMin picking doesn’t require a distance matrix
(NN list) it is significantly cheaper than Taylor-Butina.
• The distance bound for a given coverage is
discovered rather than specified (by trial and error).
– Taylor, JCICS, 1995, 35, pp. 59-67.
– Butina, JCICS, 1999, 39, pp. 747-750.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Artificial intelligence (MINI-MAX)
Los Alamos Chess (6x6 board)
White has 17 possible moves.
The 11 that don’t check, lose.
Five checks, lose the queen.
MAX
MIN
5 3 2 4 0 6 1 3
3
3
2 0 1
Alpha cut-offs allow us to prune the search tree.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Max-min picking
1
2 3 4 5 6 7 8 9
1
0
2
Candidate Pool
Picks
3
1
4 1 5 9 2 6 5 3
5 8 9 7 9 3 2 3 8 4
6
Minimums
Maximum
2 26 4 3 3 8 3 7
9 5 0 2 8 8 4 1 9 7
3 1 0 1 3 3 2 1 2 3
3
3
0
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Max-min picking
0 1
2 3 4 5 6 7 8 9
1
0
2
Candidate Pool
Picks
3
1
4 1 5 9 2 6 5 3
5 8 9 7 9 3 2 3 8 4
6
< Bounds
Maximum
2 26 4 3 3 8 3 7
9 5 0 2 8 8 4 1 9 7
3 1 0 1 3 3 2 3 2 3
3
3
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Rdkit implementation
• For each candidate we track its bound, the number
of picks used to calculate this bound, and a pointer
to the next pool candidate (a singly linked list).
• Memory usage is O(poolsize), less than INT_LIST.
• No need for a distance matrix or distmat cache.
• Linked list preserves tie splitting/skips picked items.
• The same data structure is (ab)used as a visit array in
an initial pass to remove firstPicks from the pool.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Performance improvement
• Using Andrew Dalke’s data set of 12386
benzodiazepines, using the first one thousand as the
existing screening library, select the next 18 most
diverse molecules from the remaining set (of 11386).
– Original RDKit Implementation:
• 224,688,273 FP comparisons 96.30s
– Pruning using alpha cut-off:
• 16,069,573 FP comparisons 6.79s (14x)
– Preserving bounds across picks:
• 1,047,982 FP comparisons 0.46s (209x)
– Timings on Dell laptop, using 2048 bit Morgan radius 2 FPs.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Significant similarity
• Different similarity measures require different
significance thresholds to distinguish random hits.
90%:0.528 95%:0.574 99%:0.656 90%:0.245 95%:0.271 99%:0.323
Thresholds for “random” in fingerprints the RDKit supports. 2013/10
In this talk I’ll ignore the influence of query and database composition on score significance.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Results: data set sampling
• ChEMBL 23 (1727053 compounds)
– 90%: 15107 compounds [5.45B FP cmps]
– 95%: 24430 compounds [9.82B FP cmps]
– 99%: 51463 compounds [23.9B FP cmps]
• eMolecules170601 (14328534 compounds)
– 90%: 19049 compounds [45.7B FP cmps]
– 95%: 32780 compounds [86.5B FP cmps]
– 99%: 80135 compounds [247B FP cmps]
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Astrazeneca vs. bayer
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Astrazeneca vs. bayer
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Screening library enhancement
• Selecting 1K compounds for purchase from
eMolecules (14M) to enhance ChEMBL (1.7M).
– Reading eMolecules: 4780s
– Reading ChEMBL: 821s
– Generating FPs: 1456s
– MaxMinPicker: 42773s[80B FP cmps]
• Fazit: Large scale diversity selection can be run
overnight on a single CPU core.
• Selecting the first 18 compounds takes only 399s
[715M FP cmps].
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Additional considerations #1
• Rule 1: It’s better to filter compounds for desirability,
physical properties, Lipinski, price before picking.
– Filtering is O(N), Picking isn’t.
• With library screening, it’s often preferable to have
hits with neighbors rather than singletons, to confirm
true positives, or provide initial QSAR; either
– Picking should be performed on the pool of compounds
that have neighbors
– Each pick confirmed to have a neighbour as it is chosen.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Additional considerations #2
• If required, it is possible to sample some areas of
chemical space (kinase inhibitors or fragments) with
higher density than others.
• For those that want to do fingerprint comparison
faster than RDKit, Andrew Dalke’s ChemFP is worth a
look.
• One possible refinement is to de-duplicate identical
fingerprints (efficiently done using a hash table).
• Finally, a better similarity measure should produce
better results (SmallWorld sales pitch goes here)
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Implementing full Kennard-stone
• The original Kennard-Stone algorithm requires the
first two picks to be the furthest apart in a data set.
• Traditionally, this has required O(N2) time.
• A break-through in theoretical computer science
since 2013 has changed this assumption:
– Pilu Crescenzi, Roberto Grossi, Michel Habib, Leonardo Lanzi, Andrea Marino, “On computing the
diameter of real-world undirected graphs”, Theoretical Computer Science, Vol. 514, pp. 84-95, 2013.
– Michele Borassi, Pierluigi Crescenzi, Michel Habib, Walter Kosters, Andrea Marino and Frank Takes,
“On the Solvability of the Six Degrees of Kevin Bacon Game: A Faster Graph Diameter and Radius
Computation Method”, 2014.
– Michele Borassi, Pierluigi Crescenzi, Michel Habib, Walter A. Kosters, Andrea Marino and Frank W.
Takes, “Fast diameter and radius BFS-based computation in (weakly connected) real-world graphs
with an application to the six degrees of separation games”, Theoretical Computer Science, Vol. 586,
pp. 59-80, 2015.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Bounded vertex eccentricity
• The eccentricity of a vertex, v, is the greatest distance to any other vertex.
• The radius of a graph is the minimum eccentricity.
• The diameter of a graph is a maximum eccentricity.
Image Credit: Tapiocozzo, Wikipedia page on “Centrality”.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
Sumsweep results
• Finding furthest two atoms of 327 heavy atoms in
the protein crambin (1CRN)
– Brute Force: 53301 comparisons
– SUMSWEEP: 6636 comparisons [k=21]
• Finding furthest apart (using Hamming distance on
2K MFP2 FPs) of 250251 compounds in NCI August
2000 data set.
– Brute Force: 31,312,656,375 comparisons
– SUMSWEEP: 43,278,372 comparisons [k=173]
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
conclusions
• A little computer science can make compound
purchasing decisions a whole lot faster.
6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017
acknowledgements
• The RDKit crew
– Greg Landrum and Brian Kelley
• The NextMove Software crew
– John Mayfield and Noel O’Boyle
• Industrial Inspiration
– Darren Green, GSK
– Roman Affentranger, Roche
– Pat Walters, Relay Therapeutics
– Andrew Dalke, Dalke Scientific.

More Related Content

PDF
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
PDF
Is 20TB really Big Data?
PDF
When The New Science Is In The Outliers
PDF
Substructure Search Face-off
PDF
Applications of Machine Learning for Materials Discovery at NREL
PDF
Autonomous experimental phase diagram acquisition
PDF
Graphs, Environments, and Machine Learning for Materials Science
PDF
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Is 20TB really Big Data?
When The New Science Is In The Outliers
Substructure Search Face-off
Applications of Machine Learning for Materials Discovery at NREL
Autonomous experimental phase diagram acquisition
Graphs, Environments, and Machine Learning for Materials Science
Software Tools, Methods and Applications of Machine Learning in Functional Ma...

What's hot (20)

PDF
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
PDF
Large scale classification of chemical reactions from patent data
PDF
A Machine Learning Framework for Materials Knowledge Systems
PDF
Smart Metrics for High Performance Material Design
PDF
Software tools, crystal descriptors, and machine learning applied to material...
PDF
Machine learning for materials design: opportunities, challenges, and methods
PDF
Software tools for high-throughput materials data generation and data mining
PDF
2D/3D Materials screening and genetic algorithm with ML model
PDF
Automated Machine Learning Applied to Diverse Materials Design Problems
PDF
A Framework and Infrastructure for Uncertainty Quantification and Management ...
PDF
Open Source Tools for Materials Informatics
PDF
TMS workshop on machine learning in materials science: Intro to deep learning...
PDF
Conducting and Enabling Data-Driven Research Through the Materials Project
PDF
Discovering advanced materials for energy applications by mining the scientif...
PDF
The MGI and AI
PDF
Software tools for data-driven research and their application to thermoelectr...
PDF
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
PDF
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
PDF
Software tools, crystal descriptors, and machine learning applied to material...
PDF
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Large scale classification of chemical reactions from patent data
A Machine Learning Framework for Materials Knowledge Systems
Smart Metrics for High Performance Material Design
Software tools, crystal descriptors, and machine learning applied to material...
Machine learning for materials design: opportunities, challenges, and methods
Software tools for high-throughput materials data generation and data mining
2D/3D Materials screening and genetic algorithm with ML model
Automated Machine Learning Applied to Diverse Materials Design Problems
A Framework and Infrastructure for Uncertainty Quantification and Management ...
Open Source Tools for Materials Informatics
TMS workshop on machine learning in materials science: Intro to deep learning...
Conducting and Enabling Data-Driven Research Through the Materials Project
Discovering advanced materials for energy applications by mining the scientif...
The MGI and AI
Software tools for data-driven research and their application to thermoelectr...
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Software tools, crystal descriptors, and machine learning applied to material...
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...
Ad

Similar to Recent improvements to the RDKit (20)

PDF
RDKit Gems
PDF
R and cpp
PDF
Processing malaria HTS results using KNIME: a tutorial
PDF
Fingerprinting Chemical Structures
PDF
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
PDF
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
PDF
Overview of Movie Recommendation System using Machine learning by R programmi...
PDF
Data mining projects topics for java and dot net
PDF
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
PDF
CDAC 2018 Pellegrini clustering ppi networks
PDF
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...
PDF
Robots, Small Molecules & R
PPTX
R for hadoopers
PPTX
Bioinformatics t4-alignments wim_vancriekingev2013
PDF
GPUFish_technical_report
PDF
R and C++
PPT
Lecture slides week14-15
PPTX
T digest-update
PDF
Recommender Systems with Implicit Feedback Challenges, Techniques, and Applic...
RDKit Gems
R and cpp
Processing malaria HTS results using KNIME: a tutorial
Fingerprinting Chemical Structures
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Overview of Movie Recommendation System using Machine learning by R programmi...
Data mining projects topics for java and dot net
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
CDAC 2018 Pellegrini clustering ppi networks
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...
Robots, Small Molecules & R
R for hadoopers
Bioinformatics t4-alignments wim_vancriekingev2013
GPUFish_technical_report
R and C++
Lecture slides week14-15
T digest-update
Recommender Systems with Implicit Feedback Challenges, Techniques, and Applic...
Ad

More from NextMove Software (20)

PDF
DeepSMILES
PDF
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
PDF
Building a bridge between human-readable and machine-readable representations...
PDF
CINF 35: Structure searching for patent information: The need for speed
PDF
A de facto standard or a free-for-all? A benchmark for reading SMILES
PDF
Can we agree on the structure represented by a SMILES string? A benchmark dat...
PDF
Comparing Cahn-Ingold-Prelog Rule Implementations
PDF
Eugene Garfield: the father of chemical text mining and artificial intelligen...
PDF
Pharmaceutical industry best practices in lessons learned: ELN implementation...
PDF
Digital Chemical Representations
PDF
Challenges and successes in machine interpretation of Markush descriptions
PDF
PubChem as a Biologics Database
PDF
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
PDF
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
PDF
Building on Sand: Standard InChIs on non-standard molfiles
PDF
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
PDF
Advanced grammars for state-of-the-art named entity recognition (NER)
PDF
Challenges in Chemical Information Exchange
PDF
Automatic extraction of bioactivity data from patents
PDF
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
DeepSMILES
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
Building a bridge between human-readable and machine-readable representations...
CINF 35: Structure searching for patent information: The need for speed
A de facto standard or a free-for-all? A benchmark for reading SMILES
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Comparing Cahn-Ingold-Prelog Rule Implementations
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Digital Chemical Representations
Challenges and successes in machine interpretation of Markush descriptions
PubChem as a Biologics Database
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
Building on Sand: Standard InChIs on non-standard molfiles
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Advanced grammars for state-of-the-art named entity recognition (NER)
Challenges in Chemical Information Exchange
Automatic extraction of bioactivity data from patents
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]

Recently uploaded (20)

PPTX
Seminar Hypertension and Kidney diseases.pptx
PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PPTX
Introcution to Microbes Burton's Biology for the Health
PPTX
Fluid dynamics vivavoce presentation of prakash
PDF
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
Science Quipper for lesson in grade 8 Matatag Curriculum
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
The Land of Punt — A research by Dhani Irwanto
PDF
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PPTX
BODY FLUIDS AND CIRCULATION class 11 .pptx
PPTX
Biomechanics of the Hip - Basic Science.pptx
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPTX
Microbes in human welfare class 12 .pptx
PPT
6.1 High Risk New Born. Padetric health ppt
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Seminar Hypertension and Kidney diseases.pptx
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
Introcution to Microbes Burton's Biology for the Health
Fluid dynamics vivavoce presentation of prakash
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
TOTAL hIP ARTHROPLASTY Presentation.pptx
Science Quipper for lesson in grade 8 Matatag Curriculum
Biophysics 2.pdffffffffffffffffffffffffff
The Land of Punt — A research by Dhani Irwanto
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
BODY FLUIDS AND CIRCULATION class 11 .pptx
Biomechanics of the Hip - Basic Science.pptx
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Microbes in human welfare class 12 .pptx
6.1 High Risk New Born. Padetric health ppt
Placing the Near-Earth Object Impact Probability in Context
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...

Recent improvements to the RDKit

  • 1. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Recent improvements to the rdkit Roger Sayle NextMove Software, Cambridge, UK
  • 2. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Motivation: compound acquisition • Given a existing screening collection of X compounds, and with Y vendor compounds available for purchase, how should I select the next Z diverse compounds to buy. • Typically, X is about 2M and Y is about 100M.
  • 3. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Rdkit’s maxminpicker • Picking diverse compounds from large sets, 2014/08 • Optimizing Diversity Picking in the RDKit, 2014/08 • M. Ashton, J. Barnard, P. Willett et al., “Identification of Diverse Database Subsets using Property-based and Fragment-based Molecular Descriptors”, Quant. Struct.-Act. Relat., Vol. 21, pp. 598-604, 2002. • R. Kennard and L. Stone, “Computer aided design of experiments”, Technometrics, Vol. 11, No. 1, pp. 137- 148, 1969.
  • 4. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Conceptual algorithm • If no compounds have been picked so far, choose the first picked compound at random. • Repeatedly select the compound furthest from it’s nearest picked compound [hence the name maximum-minimum distance]. • Continue until the desired number of picked compounds has been selected (or the pool of available compounds has been exhausted).
  • 5. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Selection visualization Image Credits: Antoine Stevens, the ProspectR package on github
  • 6. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Pros vs. cons • Optimal picking is (NP-)Hard. • Density vs. Diversity – With biased data sets, random sampling follows density, where MaxMin optimizes coverage. – Picking != Clustering. • Worst-case NN-1 bounds – It’s possible to (estimate a) bound on the worst case distance to nearest neighbor. • Fraction of Data Set to be Sampled • Scaling Performance of Algorithm
  • 7. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Distance matrices are a bad idea • Naïvely, one approach is to provide a picking algorithm with a full distance matrix. • If X compounds are picked already, and Y is the initial pool to pick, from this requires (X+Y-1)*(X+Y)/2 time and space. • One goal is to keep memory requirements O(X+Y). • An intelligent algorithm should be able to avoid calculating any given distance more than once.
  • 8. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Taylor-butina vs. maxmin picking • At a comparable distance threshold, MaxMin picks are also a Leader (Tabu) clustering. • As MaxMin picking doesn’t require a distance matrix (NN list) it is significantly cheaper than Taylor-Butina. • The distance bound for a given coverage is discovered rather than specified (by trial and error). – Taylor, JCICS, 1995, 35, pp. 59-67. – Butina, JCICS, 1999, 39, pp. 747-750.
  • 9. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Artificial intelligence (MINI-MAX) Los Alamos Chess (6x6 board) White has 17 possible moves. The 11 that don’t check, lose. Five checks, lose the queen. MAX MIN 5 3 2 4 0 6 1 3 3 3 2 0 1 Alpha cut-offs allow us to prune the search tree.
  • 10. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Max-min picking 1 2 3 4 5 6 7 8 9 1 0 2 Candidate Pool Picks 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 Minimums Maximum 2 26 4 3 3 8 3 7 9 5 0 2 8 8 4 1 9 7 3 1 0 1 3 3 2 1 2 3 3 3 0
  • 11. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Max-min picking 0 1 2 3 4 5 6 7 8 9 1 0 2 Candidate Pool Picks 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 < Bounds Maximum 2 26 4 3 3 8 3 7 9 5 0 2 8 8 4 1 9 7 3 1 0 1 3 3 2 3 2 3 3 3
  • 12. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Rdkit implementation • For each candidate we track its bound, the number of picks used to calculate this bound, and a pointer to the next pool candidate (a singly linked list). • Memory usage is O(poolsize), less than INT_LIST. • No need for a distance matrix or distmat cache. • Linked list preserves tie splitting/skips picked items. • The same data structure is (ab)used as a visit array in an initial pass to remove firstPicks from the pool.
  • 13. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Performance improvement • Using Andrew Dalke’s data set of 12386 benzodiazepines, using the first one thousand as the existing screening library, select the next 18 most diverse molecules from the remaining set (of 11386). – Original RDKit Implementation: • 224,688,273 FP comparisons 96.30s – Pruning using alpha cut-off: • 16,069,573 FP comparisons 6.79s (14x) – Preserving bounds across picks: • 1,047,982 FP comparisons 0.46s (209x) – Timings on Dell laptop, using 2048 bit Morgan radius 2 FPs.
  • 14. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Significant similarity • Different similarity measures require different significance thresholds to distinguish random hits. 90%:0.528 95%:0.574 99%:0.656 90%:0.245 95%:0.271 99%:0.323 Thresholds for “random” in fingerprints the RDKit supports. 2013/10 In this talk I’ll ignore the influence of query and database composition on score significance.
  • 15. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Results: data set sampling • ChEMBL 23 (1727053 compounds) – 90%: 15107 compounds [5.45B FP cmps] – 95%: 24430 compounds [9.82B FP cmps] – 99%: 51463 compounds [23.9B FP cmps] • eMolecules170601 (14328534 compounds) – 90%: 19049 compounds [45.7B FP cmps] – 95%: 32780 compounds [86.5B FP cmps] – 99%: 80135 compounds [247B FP cmps]
  • 16. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Astrazeneca vs. bayer
  • 17. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Astrazeneca vs. bayer
  • 18. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Screening library enhancement • Selecting 1K compounds for purchase from eMolecules (14M) to enhance ChEMBL (1.7M). – Reading eMolecules: 4780s – Reading ChEMBL: 821s – Generating FPs: 1456s – MaxMinPicker: 42773s[80B FP cmps] • Fazit: Large scale diversity selection can be run overnight on a single CPU core. • Selecting the first 18 compounds takes only 399s [715M FP cmps].
  • 19. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Additional considerations #1 • Rule 1: It’s better to filter compounds for desirability, physical properties, Lipinski, price before picking. – Filtering is O(N), Picking isn’t. • With library screening, it’s often preferable to have hits with neighbors rather than singletons, to confirm true positives, or provide initial QSAR; either – Picking should be performed on the pool of compounds that have neighbors – Each pick confirmed to have a neighbour as it is chosen.
  • 20. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Additional considerations #2 • If required, it is possible to sample some areas of chemical space (kinase inhibitors or fragments) with higher density than others. • For those that want to do fingerprint comparison faster than RDKit, Andrew Dalke’s ChemFP is worth a look. • One possible refinement is to de-duplicate identical fingerprints (efficiently done using a hash table). • Finally, a better similarity measure should produce better results (SmallWorld sales pitch goes here)
  • 21. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Implementing full Kennard-stone • The original Kennard-Stone algorithm requires the first two picks to be the furthest apart in a data set. • Traditionally, this has required O(N2) time. • A break-through in theoretical computer science since 2013 has changed this assumption: – Pilu Crescenzi, Roberto Grossi, Michel Habib, Leonardo Lanzi, Andrea Marino, “On computing the diameter of real-world undirected graphs”, Theoretical Computer Science, Vol. 514, pp. 84-95, 2013. – Michele Borassi, Pierluigi Crescenzi, Michel Habib, Walter Kosters, Andrea Marino and Frank Takes, “On the Solvability of the Six Degrees of Kevin Bacon Game: A Faster Graph Diameter and Radius Computation Method”, 2014. – Michele Borassi, Pierluigi Crescenzi, Michel Habib, Walter A. Kosters, Andrea Marino and Frank W. Takes, “Fast diameter and radius BFS-based computation in (weakly connected) real-world graphs with an application to the six degrees of separation games”, Theoretical Computer Science, Vol. 586, pp. 59-80, 2015.
  • 22. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Bounded vertex eccentricity • The eccentricity of a vertex, v, is the greatest distance to any other vertex. • The radius of a graph is the minimum eccentricity. • The diameter of a graph is a maximum eccentricity. Image Credit: Tapiocozzo, Wikipedia page on “Centrality”.
  • 23. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 Sumsweep results • Finding furthest two atoms of 327 heavy atoms in the protein crambin (1CRN) – Brute Force: 53301 comparisons – SUMSWEEP: 6636 comparisons [k=21] • Finding furthest apart (using Hamming distance on 2K MFP2 FPs) of 250251 compounds in NCI August 2000 data set. – Brute Force: 31,312,656,375 comparisons – SUMSWEEP: 43,278,372 comparisons [k=173]
  • 24. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 conclusions • A little computer science can make compound purchasing decisions a whole lot faster.
  • 25. 6th RDKit UGM, Berlin, Germany, Thursday 21st September 2017 acknowledgements • The RDKit crew – Greg Landrum and Brian Kelley • The NextMove Software crew – John Mayfield and Noel O’Boyle • Industrial Inspiration – Darren Green, GSK – Roman Affentranger, Roche – Pat Walters, Relay Therapeutics – Andrew Dalke, Dalke Scientific.