SlideShare a Scribd company logo
Benchmark set 1: SMILES valence model
Comparison 1: Compare readers on the same dataset, from CDK
Benchmark set 2: Aromatic SMILES for ChEMBL ring systemsMotivation Comparison 2: Compare to Open Babel across all 11 datasets
Conclusions
www.nextmovesoftware.com
@nmsoftware
NextMove Software Limited
Innovation Centre (Unit 23)
Cambridge Science Park
Milton Road, Cambridge
UK CB4 0EY
Can we agree on the structure represented by a SMILES string?
A benchmark dataset
Noel M. O’Boyle, John W. Mayfield, Roger A. Sayle
NextMove Software Ltd, Cambridge, UK
Our starting point is the axiom that a SMILES string represents a
particular molecule. The job of a SMILES reader is to faithfully
recreate that molecule.
We quantify to what extent different SMILES readers agree on the
molecule represented by a SMILES string. Our goal is to improve the
interoperability of SMILES strings by identifying ambiguities in the
specification and by working with toolkit developers to resolve bugs.
How many hydrogens are on the nitrogen in N(C)(C)(C)C? This
atom type (N4) was tested, along with 60 other atom types.
Disagreements with the specification [1] (and Dave Weininger’s own
code [2]) are listed below.
By comparing to a particular reader across all datasets, corner cases
and bugs can be identified. Here are results compared to Open Babel,
counting how many SMILES resulted in different hydrogen counts or
where one program gave an error but the other did not.
As a sanity check, the test was repeated but with hydrogen count
specified, e.g. [NH](C)(C)(C)C. This is respected by all of the
toolkits. Interestingly, Indigo no longer rejects any of the atom types.
https://github.com/nextmovesoftware/smilesreading
The dataset contains 47463 unique ring systems derived from ChEMBL
23. Non-ring atoms were included if attached via double bonds, or via
single bonds but only if from a non-carbon ring atom.
For each of the 11 benchmark datasets, every toolkit tested was
required to:
1. read the SMILES
2. report any kekulization or parse errors
3. report the hydrogen count on each atom (if no error)
Avalon Cl2 Cl4 Br2 Br4 I2 I4
BIOVIA Draw Cl2 Cl4 Br2 Br4 I2 I4
Cactvs N4.P4.S3.S5 (or N4*)
CDK
CEX (Weininger)
ChemDoodle
ChemDraw
Indigo†
iwtoolkit
N4 Cl2 Cl3 Cl4 Cl5 Br2 Br3
Br4 I2 I4 (or P4 S3 S5*)
JChem N4
KnowItAll
OEChem
Open Babel
OpenChemLib N4 Cl2 Cl4 Br2 Br4 I2 I4
RDKit† P6 I3 I4
* If the default options are modified
† Results exclude 17 atom types rejected by Indigo, and 19 rejected by RDKit
Acknowledgements and software versions
Thanks to the developers of many of the toolkits tested for interesting discussions, and
Matt Swain for providing results. The toolkits tested were: Avalon 1.2, BIOVIA Draw 2018,
Cactvs 3.4.6.25, CDK 2.1, CEX 1.3.2, ChemDoodle API 2.3.0, ChemDraw 16.0, Indigo
1.3.0b.r16, iwtoolkit Oct2017, JChem 17.23, KnowItAll 2018, OEChem Feb 2018, Open
Babel (dev) May2018, OpenChemLib 2018.5.0, RDKit 2018.03.1.
We believe this benchmark dataset to be a useful resource for the
improvement of SMILES interoperability. For all of the toolkits tested,
the results yield a treasure trove of corner cases and bugs. These
results have already led to changes to Cactvs, CDK, ChemDoodle,
iwtoolkit, KnowItAll, Open Babel and OpenChemLib.
We encourage any toolkit developers interested in improving SMILES
interoperability to get in touch, or just download the benchmark at
the URL above and try it out.
O1ON1C
C1C(=O)C1
C1=NN=NS1
CDK
o1on1C
C1C(=O)C1
c1nnns1
OpenChemLib
CN1OO1
O=C1CC1
c1nnns1
RDKit
o1on1C
C1C(=O)C1
c1nnns1
11 benchmark datasets
+8 othersChEMBL
Different
H Count
Kekulization
Failure
Avalon 0 1
BIOVIA Draw 0 0
CDK 0 0
ChemDoodle 13*
ChemDraw 7 25
Indigo† 456 23
iwtoolkit 91 69
JChem 5 8
OEChem 0 0
Open Babel 0 0
OpenChemLib 9 136
RDKit† 7 1
* It is not possible to distinguish between kekulization failures and differences in
hydrogen count
† Results exclude 8 structures rejected by Indigo, and 15 by RDKit
Myth bust: Do differences in aromaticity models create problems for
SMILES readers? No – the problems are caused by kekulization
algorithms that are not sufficiently robust. [3]
Differences Ignoring errors
Avalon 166 33
BIOVIA Draw 2837 21
CDK 205 24
ChemDoodle 4333 179
ChemDraw 1027 71
Indigo 6110 (6062*) 1769
iwtoolkit 6839 1179
JChem 318 43
OEChem 436 18
Open Babel - -
OpenChemLib 1367 89
RDKit 342 (235*) 50
If we inspect the CDK results, we find that SMILES with contradictory
stereobond symbols (e.g. C1CCCCN2/C(=N1)CN=C2) are accepted
by Open Babel (with warning) but rejected by CDK. Another case is
SMILES with stereobond symbols in aromatic rings; these are treated
by Open Babel as explicit single bonds but by CDK as implicit bonds.
* Differences ignoring errors about bad valence
1. Daylight Theory Manual
http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
2. Weininger, D. Chemical EXchange 1.3.2. https://github.com/nextmovesoftware/CEX
3. O’Boyle, N.M.; Mayfield, J.W. https://www.slideshare.net/baoilleach/we-need-to-talk-
about-kekulization-aromaticity-and-smiles
Bibliography

More Related Content

PDF
DeepSMILES
PDF
Building on Sand: Standard InChIs on non-standard molfiles
PDF
A de facto standard or a free-for-all? A benchmark for reading SMILES
PPTX
Universal Smiles: Finally a canonical SMILES string
PDF
Cheminformatics toolkits: a personal perspective
PPTX
So I have an SD File... What do I do next?
PPTX
We need to talk about Kekulization, Aromaticity and SMILES
PPTX
Improving the quality of chemical databases with community-developed tools (a...
DeepSMILES
Building on Sand: Standard InChIs on non-standard molfiles
A de facto standard or a free-for-all? A benchmark for reading SMILES
Universal Smiles: Finally a canonical SMILES string
Cheminformatics toolkits: a personal perspective
So I have an SD File... What do I do next?
We need to talk about Kekulization, Aromaticity and SMILES
Improving the quality of chemical databases with community-developed tools (a...

Similar to Can we agree on the structure represented by a SMILES string? A benchmark dataset (20)

PDF
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
PPTX
Intro to Open Babel
PPTX
Understanding Smiles
PPTX
What's New and Cooking in Open Babel 2.3.2
PDF
So I have an SD File … What do I do next?
PDF
Classifying Chemicals with Description Graphs and Logic Programming
PPTX
Cinfony - Bring cheminformatics toolkits into tune
PDF
ORVIL Manual
PDF
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
PDF
Open-source from/in the enterprise: the RDKit
PPTX
Cheminformatics
PDF
Modelling Structured Domains with Description Graphs and Logic Programming
PPTX
The PubChemQC Project
PPT
Chemicals, Chemical Identifiers and Navigating Through Databases
PDF
Some "challenges" on the open-source/open-data front
PPTX
Cinfony - Combining disparate cheminformatics resources into a single toolkit
PDF
Challenges in Chemical Information Exchange
PPTX
Representing Chemicals Digitally: An overview of Cheminformatics
PPTX
20130724 cisrg sugars_batchelor
PPTX
A Brief Overview of Cheminformatics
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Intro to Open Babel
Understanding Smiles
What's New and Cooking in Open Babel 2.3.2
So I have an SD File … What do I do next?
Classifying Chemicals with Description Graphs and Logic Programming
Cinfony - Bring cheminformatics toolkits into tune
ORVIL Manual
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
Open-source from/in the enterprise: the RDKit
Cheminformatics
Modelling Structured Domains with Description Graphs and Logic Programming
The PubChemQC Project
Chemicals, Chemical Identifiers and Navigating Through Databases
Some "challenges" on the open-source/open-data front
Cinfony - Combining disparate cheminformatics resources into a single toolkit
Challenges in Chemical Information Exchange
Representing Chemicals Digitally: An overview of Cheminformatics
20130724 cisrg sugars_batchelor
A Brief Overview of Cheminformatics
Ad

More from NextMove Software (20)

PDF
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
PDF
Building a bridge between human-readable and machine-readable representations...
PDF
CINF 35: Structure searching for patent information: The need for speed
PDF
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
PDF
Comparing Cahn-Ingold-Prelog Rule Implementations
PDF
Eugene Garfield: the father of chemical text mining and artificial intelligen...
PDF
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
PDF
Recent improvements to the RDKit
PDF
Pharmaceutical industry best practices in lessons learned: ELN implementation...
PDF
Digital Chemical Representations
PDF
Challenges and successes in machine interpretation of Markush descriptions
PDF
PubChem as a Biologics Database
PDF
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
PDF
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
PDF
Advanced grammars for state-of-the-art named entity recognition (NER)
PDF
Automatic extraction of bioactivity data from patents
PDF
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
PDF
RDKit UGM 2016: Higher Quality Chemical Depictions
PDF
Chemical structure representation in PubChem
PDF
GHS and NFPA diamonds: where they come from and how they can be useful
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
Building a bridge between human-readable and machine-readable representations...
CINF 35: Structure searching for patent information: The need for speed
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Comparing Cahn-Ingold-Prelog Rule Implementations
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Recent improvements to the RDKit
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Digital Chemical Representations
Challenges and successes in machine interpretation of Markush descriptions
PubChem as a Biologics Database
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Advanced grammars for state-of-the-art named entity recognition (NER)
Automatic extraction of bioactivity data from patents
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit UGM 2016: Higher Quality Chemical Depictions
Chemical structure representation in PubChem
GHS and NFPA diamonds: where they come from and how they can be useful
Ad

Recently uploaded (20)

PPTX
BIOMOLECULES PPT........................
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
Pharmacology of Autonomic nervous system
PPTX
Microbes in human welfare class 12 .pptx
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
An interstellar mission to test astrophysical black holes
PDF
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PDF
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPT
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
BIOMOLECULES PPT........................
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
7. General Toxicologyfor clinical phrmacy.pptx
Pharmacology of Autonomic nervous system
Microbes in human welfare class 12 .pptx
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
An interstellar mission to test astrophysical black holes
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
Placing the Near-Earth Object Impact Probability in Context
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
Biophysics 2.pdffffffffffffffffffffffffff
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.

Can we agree on the structure represented by a SMILES string? A benchmark dataset

  • 1. Benchmark set 1: SMILES valence model Comparison 1: Compare readers on the same dataset, from CDK Benchmark set 2: Aromatic SMILES for ChEMBL ring systemsMotivation Comparison 2: Compare to Open Babel across all 11 datasets Conclusions www.nextmovesoftware.com @nmsoftware NextMove Software Limited Innovation Centre (Unit 23) Cambridge Science Park Milton Road, Cambridge UK CB4 0EY Can we agree on the structure represented by a SMILES string? A benchmark dataset Noel M. O’Boyle, John W. Mayfield, Roger A. Sayle NextMove Software Ltd, Cambridge, UK Our starting point is the axiom that a SMILES string represents a particular molecule. The job of a SMILES reader is to faithfully recreate that molecule. We quantify to what extent different SMILES readers agree on the molecule represented by a SMILES string. Our goal is to improve the interoperability of SMILES strings by identifying ambiguities in the specification and by working with toolkit developers to resolve bugs. How many hydrogens are on the nitrogen in N(C)(C)(C)C? This atom type (N4) was tested, along with 60 other atom types. Disagreements with the specification [1] (and Dave Weininger’s own code [2]) are listed below. By comparing to a particular reader across all datasets, corner cases and bugs can be identified. Here are results compared to Open Babel, counting how many SMILES resulted in different hydrogen counts or where one program gave an error but the other did not. As a sanity check, the test was repeated but with hydrogen count specified, e.g. [NH](C)(C)(C)C. This is respected by all of the toolkits. Interestingly, Indigo no longer rejects any of the atom types. https://github.com/nextmovesoftware/smilesreading The dataset contains 47463 unique ring systems derived from ChEMBL 23. Non-ring atoms were included if attached via double bonds, or via single bonds but only if from a non-carbon ring atom. For each of the 11 benchmark datasets, every toolkit tested was required to: 1. read the SMILES 2. report any kekulization or parse errors 3. report the hydrogen count on each atom (if no error) Avalon Cl2 Cl4 Br2 Br4 I2 I4 BIOVIA Draw Cl2 Cl4 Br2 Br4 I2 I4 Cactvs N4.P4.S3.S5 (or N4*) CDK CEX (Weininger) ChemDoodle ChemDraw Indigo† iwtoolkit N4 Cl2 Cl3 Cl4 Cl5 Br2 Br3 Br4 I2 I4 (or P4 S3 S5*) JChem N4 KnowItAll OEChem Open Babel OpenChemLib N4 Cl2 Cl4 Br2 Br4 I2 I4 RDKit† P6 I3 I4 * If the default options are modified † Results exclude 17 atom types rejected by Indigo, and 19 rejected by RDKit Acknowledgements and software versions Thanks to the developers of many of the toolkits tested for interesting discussions, and Matt Swain for providing results. The toolkits tested were: Avalon 1.2, BIOVIA Draw 2018, Cactvs 3.4.6.25, CDK 2.1, CEX 1.3.2, ChemDoodle API 2.3.0, ChemDraw 16.0, Indigo 1.3.0b.r16, iwtoolkit Oct2017, JChem 17.23, KnowItAll 2018, OEChem Feb 2018, Open Babel (dev) May2018, OpenChemLib 2018.5.0, RDKit 2018.03.1. We believe this benchmark dataset to be a useful resource for the improvement of SMILES interoperability. For all of the toolkits tested, the results yield a treasure trove of corner cases and bugs. These results have already led to changes to Cactvs, CDK, ChemDoodle, iwtoolkit, KnowItAll, Open Babel and OpenChemLib. We encourage any toolkit developers interested in improving SMILES interoperability to get in touch, or just download the benchmark at the URL above and try it out. O1ON1C C1C(=O)C1 C1=NN=NS1 CDK o1on1C C1C(=O)C1 c1nnns1 OpenChemLib CN1OO1 O=C1CC1 c1nnns1 RDKit o1on1C C1C(=O)C1 c1nnns1 11 benchmark datasets +8 othersChEMBL Different H Count Kekulization Failure Avalon 0 1 BIOVIA Draw 0 0 CDK 0 0 ChemDoodle 13* ChemDraw 7 25 Indigo† 456 23 iwtoolkit 91 69 JChem 5 8 OEChem 0 0 Open Babel 0 0 OpenChemLib 9 136 RDKit† 7 1 * It is not possible to distinguish between kekulization failures and differences in hydrogen count † Results exclude 8 structures rejected by Indigo, and 15 by RDKit Myth bust: Do differences in aromaticity models create problems for SMILES readers? No – the problems are caused by kekulization algorithms that are not sufficiently robust. [3] Differences Ignoring errors Avalon 166 33 BIOVIA Draw 2837 21 CDK 205 24 ChemDoodle 4333 179 ChemDraw 1027 71 Indigo 6110 (6062*) 1769 iwtoolkit 6839 1179 JChem 318 43 OEChem 436 18 Open Babel - - OpenChemLib 1367 89 RDKit 342 (235*) 50 If we inspect the CDK results, we find that SMILES with contradictory stereobond symbols (e.g. C1CCCCN2/C(=N1)CN=C2) are accepted by Open Babel (with warning) but rejected by CDK. Another case is SMILES with stereobond symbols in aromatic rings; these are treated by Open Babel as explicit single bonds but by CDK as implicit bonds. * Differences ignoring errors about bad valence 1. Daylight Theory Manual http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html 2. Weininger, D. Chemical EXchange 1.3.2. https://github.com/nextmovesoftware/CEX 3. O’Boyle, N.M.; Mayfield, J.W. https://www.slideshare.net/baoilleach/we-need-to-talk- about-kekulization-aromaticity-and-smiles Bibliography