SlideShare a Scribd company logo
ACS Fall 2017, Washington, D.C.
comparing cahn-ingold-prelog
rule implementations:
the need for an open cip
John	Mayfield,	Daniel	Lowe,	Roger	Sayle
“The Cahn–Ingold–Prelog (CIP) sequence rules … are a
standard process used in organic chemistry to completely and
unequivocally name a stereoisomer of a molecule.” - Wikipedia
“The Cahn–Ingold–Prelog (CIP) sequence rules … are a
standard process used in organic chemistry to completely and
unequivocally name a stereoisomer of a molecule.” - Wikipedia
If you are not naming stereoisomers
you (probably) don’t want to use CIP
Tools can give different answers,
What can we do about it?
NUMBER OF STEREOCENTRES PER ENTRY
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
% of Dataset
Count
0
1
2
3
4
5
6
7
8
9
eMolecules	2017-Jun-01
PubChem	Substance
PubChem	Compound	(Aug	17)
ChEMBL	23
ChEBI	154
+
14	million	total
234	million	total
93	million	total
1.7	million	total
95	thousand	total
Many chemists are taught the
CIP rules during their education
and is deceptively simple
‣ Simple cases are easy for a
human (and computers)
‣ Complex cases are hard for a
human (and computers)
IUPAC Blue Book (2013)
extends recommendations but
incomplete (and some mistakes)
The Sequence RULES
(in essence)
Rule 1
a. Higher atomic number precedes lower
b. An atom node duplicated closer to the root ranks higher than one
duplicated further
Rule 2 Higher atomic mass number precedes lower
Rule 3 Z precedes E and this precedes nonstereogenic (nst) double bonds
Rule 4
a. Chiral stereogenic units precede pseudoasymmetric stereogenic units
and these precede nonstereogenic units (R = S > r = s > nst)
b. When two ligands have different descriptor pairs, the one with the first
chosen like descriptor pairs has priority over the one with a corresponding
unlike descriptor pairs
c. r precedes s
Rule 5 An atom or group with descriptor R has priority over its enantiomorph S
O
H
H
H
H
H
H
H
H
H
321 5
4
6
1
2 3
5
6 4
H
Example
1. In the sphere (i) C2 and C5 are tied O > C5 = C2 > H
2. In the sphere (ii) C2 and C5 are split C,H,H > H,H,H
and therefore C2 > C5
3. The priority is 4, 2, 5, 6 and the configuration is S
(i)
(ii)
DIGRAPHS
• Rules are applied to hierarchal directed acyclic graphs
(digraphs)
• Comparison proceeds in “spheres” out from the root of the
graph
• Combinatorial explosions for some structures
H
OH
H
H
H
H
H
H H
H
H
1
7
6
5
(1)
(1)
65234
O
O
3
4 2
1
6 5
7
7
PSEUDO-ASYMMETRY
Some confusion of lower case r and s
• Assigned only when Rule 5 has been used
• Not indication of non-constitutional
Why? Reflection is superimposable:
AUXILIARY DESCRIPTORS
Auxiliary descriptors are used to split ties by symmetric
molecules by labelling the asymmetric digraphs
Tie	in	initial	digraph
Calculate	auxiliary	
descriptors
R	>	S	(Rule	5)	3:r
Picture: May, J. W. (2015). Cheminformatics for genome-scale metabolic
reconstructions (doctoral thesis).
mancude ring handling
P-92.1.4.4 Nomenclature of Organic Chemistry: IUPAC Recommendations and
Preferred Names 2013
Kekulé forms can result if different digraphs
Handled using fractional atomic numbers
The Sequence RULES
(in essence)
Rule 1
a. Higher atomic number precedes lower
b. An atom node duplicated closer to the root ranks higher than one
duplicated further
Rule 2 Higher atomic mass number precedes lower
Rule 3 Z precedes E and this precedes nonstereogenic (nst) double bonds
Rule 4
a. Chiral stereogenic units precede pseudoasymmetric stereogenic units
and these precede nonstereogenic units (R = S > r = s > nst)
b. When two ligands have different descriptor pairs, the one with the first
chosen like descriptor pairs has priority over the one with a corresponding
unlike descriptor pairs
c. r precedes s
Rule 5 An atom or group with descriptor R has priority over its enantiomorph S
ChEBI ChEMBL eMolecules PubChem	
Compound
PubChem	
Substance
Rule	1a 281K 99.6% 1.8M 98.6% 2.4M 97.0% 53.5M 100.0% 93.1M 98.7%
Rule	1b 4 1 164 255
Rule	2 14 3,565 6,789
Rule	3 29 3 441 36 45
Rule	4a 122 126 273 4 12,770
Rule	4b 563 0.2% 4,037 0.2% 3,188 0.1% 125K 0.1%
Rule	4c 19 558
Rule	5 285 0.1% 23.4K 1.2% 69K 2.8% 15 1.1M 1.2%
Total 282K 1.9M 2.4M 53.5M 94.3M
MAJORITY HANDLED BY RULE 1a
Count	is	number	of	stereocentres,	values	of	zero	and	percentages	close	to	zero	removed	to	reduce	complexity
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
1 2 3 4 5 6 7 8 9 10
Sphere
%ofStereocentres
Dataset
chebi_154
chembl_23
eMolecules170601
pubchem
pubchem_substance
distance from root
Majority (but not all)
stereocentres labelled
within first few spheres
Best to generate digraph
lazily as required
Some digraphs are far
too big to generate fully
(e.g. fullerenes)
5 6 7 8 9 10
phere
Dataset
chebi_154
chembl_23
eMolecules170601
pubchem
pubchem_substance
comparison
Rule 1A
I II
Centres 2.0 R R
JMol 14.20.3 R R
ACD/ChemSketch 14.05beta R R
Balloon 1.6.5beta R R
KnowItAll ChemWindow 2018 R R
ChemDraw 16.0 R R
BIOVIA Draw 2017 R R
MarvinSketch 17.17 R -
Indigo 1.3.0Beta.r16 - R
RDKit 2017.03.03 S R
DataWarrior 4.6.0 R R
CACTVS (NCI Resolver Aug 17) R R
OPSIN 2.3.1 R R
LexiChem (OEChem) 20170613 R R
ChemDoodle 7.0.2 R R
CDK 2.0 - R
JUMBO 6 R -
I
II
Rule 1B
Centres 2.0 R
JMol 14.20.3 R
ACD/ChemSketch 14.05beta R
Balloon 1.6.5beta R
KnowItAll ChemWindow 2018 R
ChemDraw 16.0 R
BIOVIA Draw 2017 -
MarvinSketch 17.17 -
Indigo 1.3.0Beta.r16 -
RDKit 2017.03.03 R
DataWarrior 4.6.0 -
CACTVS (NCI Resolver Aug 17) -
OPSIN 2.3.1 R
LexiChem (OEChem) 20170613 -
ChemDoodle 7.0.2 -
CDK 2.0 -
JUMBO 6 -
Rule 2
Jan 2015 Aug 2017
Centres R R
JMol n/a R
ACD/ChemSketch R R
Balloon 1.6.5beta n/a R
KnowItAll ChemWindow n/a R
ChemDraw S S
Accelrys/BIOVIA Draw S R
MarvinSketch S S
Indigo R R
RDKit S S
DataWarrior S S
CACTVS S R
OPSIN R R
LexiChem (OEChem) S R
ChemDoodle S n/a
CDK S S
JUMBO - -
R	or	S?	Let’s	Vote	https://nextmovesoftware.com/blog/2015/01/21/r-or-s-lets-vote/
Rule 4b
S
S
S
R
Centres 2.0 R
JMol 14.20.3 R
ACD/ChemSketch 14.05beta R
Balloon 1.6.5beta R
KnowItAll ChemWindow 2018 R
ChemDraw 16.0 R
BIOVIA Draw 2017 R
MarvinSketch 17.17 R
Indigo 1.3.0Beta.r16 R
RDKit 2017.03.03 S
DataWarrior 4.6.0 S
CACTVS (NCI Resolver Aug 17) S
OPSIN 2.3.1 -
LexiChem (OEChem) 20170613 -
ChemDoodle 7.0.2 s
CDK 2.0 -
JUMBO 6 -
MANCUDE RINGS
Centres 2.0 R R
JMol 14.20.3 R R
ACD/ChemSketch 14.05beta R R
Balloon 1.6.5beta R R
KnowItAll ChemWindow 2018 R R
ChemDraw 16.0 R R
BIOVIA Draw 2017 R R
MarvinSketch 17.17 R R
Indigo 1.3.0Beta.r16 S R
RDKit 2017.03.03 R R
DataWarrior 4.6.0 R R
CACTVS (NCI Resolver Aug 17) S R
OPSIN 2.3.1 S R
LexiChem (OEChem) 20170613 S R
ChemDoodle 7.0.2 S R
CDK 2.0 S R
JUMBO 6 S S
I II
I
II
Centres 2.0 R
JMol 14.20.3 R
ACD/ChemSketch 14.05beta R
Balloon 1.6.5beta R
KnowItAll ChemWindow 2018 R
ChemDraw 16.0 R
BIOVIA Draw 2017 R
MarvinSketch 17.17 -
Indigo 1.3.0Beta.r16 -
RDKit 2017.03.03 -
DataWarrior 4.6.0 -
CACTVS (NCI Resolver Aug 17) -
OPSIN 2.3.1 -
LexiChem (OEChem) 20170613 -
ChemDoodle 7.0.2 -
CDK 2.0 -
JUMBO 6 -
AUX DESCRIPTORS
hard to implement A
MarvinSketch 17.17
(S)
O
O
(S)
OH
(S)
O
O
(R)
OH
Turning aromaticity on
flips stereochemistry
(e.g. CHEBI:16063)
Labels depend on
input order
OH
1
(S)2
(r)
3
OH
4
(R)
5
OH
6
(S)7
OH
8
(s)9
HO
1 0
(R)
1 1
HO
1 2
(S)1
OH
2
OH
3
(R)
4
HO
5
OH
6
(R)
7
OH
8
(S)9
(R)
1 0
(R)
1 1
HO
1 2
(r)
1
OH
2
(s)3
HO
4
(S)5
(R)
6
(S)
7
(R)8
OH
9
OH1 0
HO
1 1
OH
1 2
hard to implement B
(R)
OH
H
(CH2)2CH2HO OH
(R)
OH
H
(CH2)11(CH2)10HO OH
OH
H
(CH2)17(CH2)16HO OH
Becomes undefined
distance ≥ 16
ChemDraw 16.0
(R)
(s)
(CH2)2
(R)
OH
(r)
(s)
(CH2)11
(R)
OH
open cip?
Why?
• Provide a blessed implementation that can be
used directly or compared against
• Toolkit agnostic library to facilitate downstream
integration
“FIX-CIP” CoLABORATION
Robert Hanson (JMol), John Mayfield (Centres)
Mikko Vainio (Balloon), Andrey Yerin (ACD/Name),
Sophia Gillian Musacchio (St. Olaf College)
Goals
• Discuss and resolve software inconsistencies
• Generate comprehensive test set based on
BlueBook structure
• Recomend rule amendments and additions
Publication in preparation
should you use CIP?
Yes
Systematic nomenclature
Human conversation (if no pen is
handy)
Probably not (better algorithms exist)
Unique labelling (see right)
Compute “conversation”
Finding/cleaning stereocentres
No
Relative comparison, e.g.
substructure search
should you use CIP?
Yes
Systematic nomenclature
Human conversation (if no pen is
handy)
Probably not (better algorithms exist)
Unique labelling (see right)
Compute “conversation”
Finding/cleaning stereocentres
No
Relative comparison, e.g.
substructure search
(S)
(S)
(R) (S)
(R)
(R)
(S)(R)
(S)
(S)
(R) (S)
(R)
(R)
(S)(R)
acknowledgements
SciMix Poster
Robert Hanson (JMol)
Mikko Vainio (Balloon)
Andrey Yerin (ACD/Name)
Sophia Gillian Musacchio (St. Olaf
College)
Karl Nedwed (Bio-Rad)
Noel O’Boyle (NextMove Software)
Shuzhe Wang (NextMove Software)
John	Mayfield,	Daniel	Lowe	and	Roger	Sayle	
NextMove	Software	Ltd,	Cambridge,	UK.
NextMove	Software	Limited	
Innovation	Centre	(Unit	23)	
Cambridge	Science	Park	
Milton	Road,	Cambridge	
UK		CB4	0EY	
www.nextmovesoftware.com
Introduction
Robert Hanson, Andrey Yerin, Mikko Vainio, and Sophia Gillian Musacchio for initiating and
participating in the “Fix CIP” collaboration and the many in-depth technical discussions that
have lead to improvements in the tools. Karl Nedwed for providing KnowItAll results. Philip
Skinner for providing ChemDraw licenses. Noel O’Boyle for feedback and suggestions.
the need for open-cip
The Cahn-Ingold-Prelog (CIP) priority rules rank atoms around a stereogenic unit to
assign a stereo-descriptor that is invariant to atom order and layout, for example R (right) or
S (left) for tetrahedral atoms.
A directed acyclic graph (digraph) is constructed for each stereogenic unit and the out
edges from the root node compared and ranked according to eight sequence rules[1]. Each
rule is applied exhaustively and tested on the entire digraph before applying the next rule[2].
Acknowledgements
Results
1. P-92.1.3 Nomenclature of Organic Chemistry: IUPAC Recommendations and Preferred Names 2013
2. Paulina Mata. The CIP System Again:  Respecting Hierarchies Is Always a Must. J. Chem. Inf. Comput. Sci., 1999,
39 (6)
Bibliography
Conclusion
The CIP sequence rules provide a standard way for chemists to effectively describe the
configurations of most stereogenic units. However, beyond simple cases the complexity of
the rules necessitates software is used as an aid to naming configurations. The results
demonstrate even then, software implementations do not all agree on the configuration.
Through the results presented here and the on-going effort of the Fix CIP collaboration,
software should aim to converge upon consistent stereochemistry naming. An Open CIP
software tool could provide “blessed” stereochemistry configuration names and provide a
standard algorithm implementation for other vendors to integrate or adapt.
Comparing Cahn-Ingold-Prelog Rule Implementations
Rule 1
a. Higher atomic number precedes lower
b. An atom node duplicated closer to the root ranks higher than one duplicated further
Rule 2 Higher atomic mass number precedes lower
Rule 3 Z precedes E and this precedes nonstereogenic (nst) double bonds
Rule 4
a. Chiral stereogenic units precede pseudoasymmetric stereogenic units and these
precede nonstereogenic units (R = S > r = s > nst)
b. When two ligands have different descriptor pairs, the one with the first chosen like
descriptor pairs has priority over the one with a corresponding unlike descriptor
pairs
c. r precedes s
Rule 5 An atom or group with descriptor R has priority over its enantiomorph S
Stereochemistry in Databases
chebi_154
chembl_23
pubchem
pubchem_substance
eMolecules170601
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
% of Dataset
Dataset
Count
0
1
2
3
4
5
6
7
8
9
eMolecules	(June	2017)
PubChem	Substance
PubChem	Compound	(Aug	2017)
ChEMBL	23
ChEBI	154
14	million	records
234	million	records
93	million	records
1.7	million	records
95	thousand	records
chebi_154
chembl_23
pubchem
pubchem_substance
eMolecules170601
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
% of Dataset
Dataset
Count
0
1
2
3
4
5
6
7
8
9
Number	of	Stereogenic	Units
+
chebi_154
chembl_23
pubchem
pubchem_substance
eMolecules170601
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
% of Dataset
Dataset
Count
0
1
2
3
4
5
6
7
8
9
The number of defined stereogenic units per molecule varies between databases.
The application of Rule 1a to the digraph for 2-butanol ranks the out edges connected to
the root as giving the label S (4 > 2 > 5 are anticlockwise looking towards 6).
ChEBI ChEMBL eMolecules PubChem
Compound1
PubChem
Substance
Rule 1a 281K 99.6% 1.8M 98.6% 2.4M 97.0% 53.5M 100.0% 93.1M 98.7%
Rule 1b 4 1 164 255
Rule 2 14 3,565 6,789
Rule 3 29 3 441 36 45
Rule 4a 122 126 273 4 12,770
Rule 4b 563 0.2% 4,037 0.2% 3,188 0.1% 125K 0.1%
Rule 4c 19 558
Rule 5 285 0.1% 23.4K 1.2% 69K 2.8% 15 1.1M 1.2%
Total 282K 1.9M 2.4M 53.5M 94.3M
The majority of stereogenic units are constitutionally asymmetric and can be ranked using
Rule 1a. However, in some datasets the number of stereogenic units requiring Rule 4b
and 5 can be significant.
I II III IV V VI VII VIII IX X XIa XIb XII XIII
Centres 2.0 R R R R R R R R R r R R r R
JMol 14.20.3 R R R R R R R R R r R R r R
ACD/ChemSketch 14.05beta R R R R R R R R R r R R r R
Balloon 1.6.5beta R R R R R R R R R r R R r R
KnowItAll ChemWindow 2018 R R R R R R R R R r R R r R5
ChemDraw 16.0 R R R R S R R R R r R R r R
BIOVIA Draw 2017 R R R - R R R R R -1 R R -1 R
MarvinSketch 17.17 R - - - S R - R - r R R r -
Indigo 1.3.0Beta.r16 -2 R - - R - R R R r S R - -
RDKit 2017.03.03 S R S R S R R S R R R R - -
DataWarrior 4.6.0 R R R - S R R S R R R3 R - -
CACTVS (NCI Resolver Aug 17) R R S - S4 R R S R R S R - -
OPSIN 2.3.1 R R R R R - - - - - S R - -
LexiChem (OEChem) 20170613 R R - - R - - - - - S R - -
ChemDoodle 7.0.2 R R - - S - - s - r S R - -
CDK 2.0 - R R5 - S - - - - - S R - -
JUMBO 6 R - S - - - - - - - S S - -
Constitutional
(Rule 1a, 1b, 2)
Geometrical +
Topographical
(Rule 3,4a,4b,4c,5)
Special
(Mancude,
Aux Descriptors)
1. Pseudoasymmetric r/s labels not displayed but must be
calculated due to answers given for IX and XIII
2. Runtime error occurs
3. Impossible to test as different Kekulé forms are normalised
4. R in CACTVS since Feb 2015, NCI resolver is old version
5. Other descriptor is assigned differently
A set of fourteen structures was collected to identify differences between software
implementations. The structures were selected to cover all the sequence rules and their
applications to special cases.
Eight sequence rules (in essence)
Fix CIP Collaboration
Since submitting this work for presentation the developers: Centres, JMol, ACD/
ChemSketch, and Balloon have begun a collaboration. We are in the process of
submitting for publication an extended in-depth validation set and proposing sequence rule
refinements and additions where they are required.
1As part of the PubChem Compound’s processing, non-constitutional stereochemistry is
removed: for example the nine stereoisomers of inositols are all represented by CID 892.
Atoms connected by double and triple bonds as well as ring closures result in
duplicated nodes in the digraph. In the structure below atoms 5 and 6 appear twice and
atom 1 (the root) appears three times.
Due to this duplication, complex ring systems can generate exponentially large digraphs
that are not computationally tractable. Further complexity in digraphs is caused by the use
of fractional atomic numbers in mancude ring-systems and assignment of auxiliary
descriptors for applying Rules 3-5.
H
OH
H
H
H
H
H
H H
H
H
1
7
6
5
(1)
(1)
65234
O
O
3
4 2
1
6 5
7
7
O
H
H
H
H
H
H
H
H
H
321 5
4
6
1
2 3
5
6 4
H

More Related Content

PPTX
Ultrasound In organic reaction and Supercritical Liquids
PPTX
Introduction to methods used for determination of Configuration
PDF
Group theory questions and answers
PPTX
Retro diels alder reaction and ortho effect
PPTX
Vilsmeier haack rxn
PPTX
Heck cross coupling reaction
PPTX
structure elucidation of Steroids and flavoniods
PPTX
Alkaloids
Ultrasound In organic reaction and Supercritical Liquids
Introduction to methods used for determination of Configuration
Group theory questions and answers
Retro diels alder reaction and ortho effect
Vilsmeier haack rxn
Heck cross coupling reaction
structure elucidation of Steroids and flavoniods
Alkaloids

What's hot (20)

PPTX
Asymmetric synthesis | chemistry presentation | 2021
PPT
Retrosynthesis
PPT
Supramolecular chemistry
PPTX
Lanthanide shift reagents in nmr
PDF
Kinetic isotope effects
PPTX
Organoborane or Organoboron compounds
PPTX
Sigmatropic rearrangement reactions (pericyclic reaction)
PPSX
Coupling Reactions
PPTX
Chiral auxiliary!
PPTX
Katsuki Sharpless Asymmetric Epoxidation and its Synthetic Applications
PPT
BIGINELLI REACTION
PPT
Camphor structural elucidation
PPT
Basic Concepts Of Retrosynthesis (Part1)
PPTX
STEREOSPECIFIC REACTION, STEREOSELECTIVE REACTION, OPTICAL PURITY, ENANTIOMER...
PPTX
Simplification process of complex 1H NMR and13C NMR
PPTX
Nucleophilic Aromatic Substitution 1
PPTX
Metathesis
PPTX
microvave assisted reaction.pptx
PPTX
Inner sphere mechanism
PPTX
Organosilicon compounds
Asymmetric synthesis | chemistry presentation | 2021
Retrosynthesis
Supramolecular chemistry
Lanthanide shift reagents in nmr
Kinetic isotope effects
Organoborane or Organoboron compounds
Sigmatropic rearrangement reactions (pericyclic reaction)
Coupling Reactions
Chiral auxiliary!
Katsuki Sharpless Asymmetric Epoxidation and its Synthetic Applications
BIGINELLI REACTION
Camphor structural elucidation
Basic Concepts Of Retrosynthesis (Part1)
STEREOSPECIFIC REACTION, STEREOSELECTIVE REACTION, OPTICAL PURITY, ENANTIOMER...
Simplification process of complex 1H NMR and13C NMR
Nucleophilic Aromatic Substitution 1
Metathesis
microvave assisted reaction.pptx
Inner sphere mechanism
Organosilicon compounds
Ad

Similar to CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an open cip (20)

PDF
Comparing Cahn-Ingold-Prelog Rule Implementations
PDF
Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS
PDF
Scientific Benchmarking of Parallel Computing Systems
PDF
Robots, Small Molecules & R
PPT
Building support for the semantic web for chemistry at the Royal Society of C...
PPT
Building support for the semantic web for chemistry at the Royal Society of C...
PPSX
University Course Timetabling by using Multi Objective Genetic Algortihms
PDF
Analysis of algorithms description pdf yeah
PPTX
Acs 2013 indianapolis_cvsp
PDF
Bioalgo 2012-03-massspec
PDF
PDF
RMG at the Flame Chemistry Workshop 2014
PDF
Integrating Countercurrent Separations into Natural Product Purification Work...
PDF
Mixtures: informatics for formulations and consumer products
PDF
AI in Chemistry: Deep Learning Models Love Really Big Data
PDF
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
PPTX
Ashg2014 grc workshop_schneider
PDF
Molecular design: One step back and two paths forward
PDF
Crunching Molecules and Numbers in R
PDF
Incremental View Maintenance for openCypher Queries
Comparing Cahn-Ingold-Prelog Rule Implementations
Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS
Scientific Benchmarking of Parallel Computing Systems
Robots, Small Molecules & R
Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...
University Course Timetabling by using Multi Objective Genetic Algortihms
Analysis of algorithms description pdf yeah
Acs 2013 indianapolis_cvsp
Bioalgo 2012-03-massspec
RMG at the Flame Chemistry Workshop 2014
Integrating Countercurrent Separations into Natural Product Purification Work...
Mixtures: informatics for formulations and consumer products
AI in Chemistry: Deep Learning Models Love Really Big Data
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
Ashg2014 grc workshop_schneider
Molecular design: One step back and two paths forward
Crunching Molecules and Numbers in R
Incremental View Maintenance for openCypher Queries
Ad

More from NextMove Software (20)

PDF
DeepSMILES
PDF
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
PDF
Building a bridge between human-readable and machine-readable representations...
PDF
CINF 35: Structure searching for patent information: The need for speed
PDF
A de facto standard or a free-for-all? A benchmark for reading SMILES
PDF
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
PDF
Can we agree on the structure represented by a SMILES string? A benchmark dat...
PDF
Eugene Garfield: the father of chemical text mining and artificial intelligen...
PDF
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
PDF
Recent improvements to the RDKit
PDF
Pharmaceutical industry best practices in lessons learned: ELN implementation...
PDF
Digital Chemical Representations
PDF
Challenges and successes in machine interpretation of Markush descriptions
PDF
PubChem as a Biologics Database
PDF
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
PDF
Building on Sand: Standard InChIs on non-standard molfiles
PDF
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
PDF
Advanced grammars for state-of-the-art named entity recognition (NER)
PDF
Challenges in Chemical Information Exchange
PDF
Automatic extraction of bioactivity data from patents
DeepSMILES
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
Building a bridge between human-readable and machine-readable representations...
CINF 35: Structure searching for patent information: The need for speed
A de facto standard or a free-for-all? A benchmark for reading SMILES
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Recent improvements to the RDKit
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Digital Chemical Representations
Challenges and successes in machine interpretation of Markush descriptions
PubChem as a Biologics Database
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
Building on Sand: Standard InChIs on non-standard molfiles
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Advanced grammars for state-of-the-art named entity recognition (NER)
Challenges in Chemical Information Exchange
Automatic extraction of bioactivity data from patents

Recently uploaded (20)

PPT
6.1 High Risk New Born. Padetric health ppt
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
BODY FLUIDS AND CIRCULATION class 11 .pptx
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
Seminar Hypertension and Kidney diseases.pptx
PDF
The scientific heritage No 166 (166) (2025)
PPTX
Introduction to Cardiovascular system_structure and functions-1
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
Overview of calcium in human muscles.pptx
PPTX
Hypertension_Training_materials_English_2024[1] (1).pptx
PDF
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
PPTX
Application of enzymes in medicine (2).pptx
PPTX
Introcution to Microbes Burton's Biology for the Health
PDF
The Land of Punt — A research by Dhani Irwanto
PPTX
The Minerals for Earth and Life Science SHS.pptx
PPTX
BIOMOLECULES PPT........................
6.1 High Risk New Born. Padetric health ppt
lecture 2026 of Sjogren's syndrome l .pdf
7. General Toxicologyfor clinical phrmacy.pptx
BODY FLUIDS AND CIRCULATION class 11 .pptx
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
Placing the Near-Earth Object Impact Probability in Context
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Seminar Hypertension and Kidney diseases.pptx
The scientific heritage No 166 (166) (2025)
Introduction to Cardiovascular system_structure and functions-1
Phytochemical Investigation of Miliusa longipes.pdf
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Overview of calcium in human muscles.pptx
Hypertension_Training_materials_English_2024[1] (1).pptx
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
Application of enzymes in medicine (2).pptx
Introcution to Microbes Burton's Biology for the Health
The Land of Punt — A research by Dhani Irwanto
The Minerals for Earth and Life Science SHS.pptx
BIOMOLECULES PPT........................

CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an open cip

  • 1. ACS Fall 2017, Washington, D.C. comparing cahn-ingold-prelog rule implementations: the need for an open cip John Mayfield, Daniel Lowe, Roger Sayle
  • 2. “The Cahn–Ingold–Prelog (CIP) sequence rules … are a standard process used in organic chemistry to completely and unequivocally name a stereoisomer of a molecule.” - Wikipedia
  • 3. “The Cahn–Ingold–Prelog (CIP) sequence rules … are a standard process used in organic chemistry to completely and unequivocally name a stereoisomer of a molecule.” - Wikipedia If you are not naming stereoisomers you (probably) don’t want to use CIP Tools can give different answers, What can we do about it?
  • 4. NUMBER OF STEREOCENTRES PER ENTRY 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 % of Dataset Count 0 1 2 3 4 5 6 7 8 9 eMolecules 2017-Jun-01 PubChem Substance PubChem Compound (Aug 17) ChEMBL 23 ChEBI 154 + 14 million total 234 million total 93 million total 1.7 million total 95 thousand total
  • 5. Many chemists are taught the CIP rules during their education and is deceptively simple ‣ Simple cases are easy for a human (and computers) ‣ Complex cases are hard for a human (and computers) IUPAC Blue Book (2013) extends recommendations but incomplete (and some mistakes)
  • 6. The Sequence RULES (in essence) Rule 1 a. Higher atomic number precedes lower b. An atom node duplicated closer to the root ranks higher than one duplicated further Rule 2 Higher atomic mass number precedes lower Rule 3 Z precedes E and this precedes nonstereogenic (nst) double bonds Rule 4 a. Chiral stereogenic units precede pseudoasymmetric stereogenic units and these precede nonstereogenic units (R = S > r = s > nst) b. When two ligands have different descriptor pairs, the one with the first chosen like descriptor pairs has priority over the one with a corresponding unlike descriptor pairs c. r precedes s Rule 5 An atom or group with descriptor R has priority over its enantiomorph S
  • 7. O H H H H H H H H H 321 5 4 6 1 2 3 5 6 4 H Example 1. In the sphere (i) C2 and C5 are tied O > C5 = C2 > H 2. In the sphere (ii) C2 and C5 are split C,H,H > H,H,H and therefore C2 > C5 3. The priority is 4, 2, 5, 6 and the configuration is S (i) (ii)
  • 8. DIGRAPHS • Rules are applied to hierarchal directed acyclic graphs (digraphs) • Comparison proceeds in “spheres” out from the root of the graph • Combinatorial explosions for some structures H OH H H H H H H H H H 1 7 6 5 (1) (1) 65234 O O 3 4 2 1 6 5 7 7
  • 9. PSEUDO-ASYMMETRY Some confusion of lower case r and s • Assigned only when Rule 5 has been used • Not indication of non-constitutional Why? Reflection is superimposable:
  • 10. AUXILIARY DESCRIPTORS Auxiliary descriptors are used to split ties by symmetric molecules by labelling the asymmetric digraphs Tie in initial digraph Calculate auxiliary descriptors R > S (Rule 5) 3:r Picture: May, J. W. (2015). Cheminformatics for genome-scale metabolic reconstructions (doctoral thesis).
  • 11. mancude ring handling P-92.1.4.4 Nomenclature of Organic Chemistry: IUPAC Recommendations and Preferred Names 2013 Kekulé forms can result if different digraphs Handled using fractional atomic numbers
  • 12. The Sequence RULES (in essence) Rule 1 a. Higher atomic number precedes lower b. An atom node duplicated closer to the root ranks higher than one duplicated further Rule 2 Higher atomic mass number precedes lower Rule 3 Z precedes E and this precedes nonstereogenic (nst) double bonds Rule 4 a. Chiral stereogenic units precede pseudoasymmetric stereogenic units and these precede nonstereogenic units (R = S > r = s > nst) b. When two ligands have different descriptor pairs, the one with the first chosen like descriptor pairs has priority over the one with a corresponding unlike descriptor pairs c. r precedes s Rule 5 An atom or group with descriptor R has priority over its enantiomorph S
  • 13. ChEBI ChEMBL eMolecules PubChem Compound PubChem Substance Rule 1a 281K 99.6% 1.8M 98.6% 2.4M 97.0% 53.5M 100.0% 93.1M 98.7% Rule 1b 4 1 164 255 Rule 2 14 3,565 6,789 Rule 3 29 3 441 36 45 Rule 4a 122 126 273 4 12,770 Rule 4b 563 0.2% 4,037 0.2% 3,188 0.1% 125K 0.1% Rule 4c 19 558 Rule 5 285 0.1% 23.4K 1.2% 69K 2.8% 15 1.1M 1.2% Total 282K 1.9M 2.4M 53.5M 94.3M MAJORITY HANDLED BY RULE 1a Count is number of stereocentres, values of zero and percentages close to zero removed to reduce complexity
  • 14. 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 1 2 3 4 5 6 7 8 9 10 Sphere %ofStereocentres Dataset chebi_154 chembl_23 eMolecules170601 pubchem pubchem_substance distance from root Majority (but not all) stereocentres labelled within first few spheres Best to generate digraph lazily as required Some digraphs are far too big to generate fully (e.g. fullerenes) 5 6 7 8 9 10 phere Dataset chebi_154 chembl_23 eMolecules170601 pubchem pubchem_substance
  • 16. Rule 1A I II Centres 2.0 R R JMol 14.20.3 R R ACD/ChemSketch 14.05beta R R Balloon 1.6.5beta R R KnowItAll ChemWindow 2018 R R ChemDraw 16.0 R R BIOVIA Draw 2017 R R MarvinSketch 17.17 R - Indigo 1.3.0Beta.r16 - R RDKit 2017.03.03 S R DataWarrior 4.6.0 R R CACTVS (NCI Resolver Aug 17) R R OPSIN 2.3.1 R R LexiChem (OEChem) 20170613 R R ChemDoodle 7.0.2 R R CDK 2.0 - R JUMBO 6 R - I II
  • 17. Rule 1B Centres 2.0 R JMol 14.20.3 R ACD/ChemSketch 14.05beta R Balloon 1.6.5beta R KnowItAll ChemWindow 2018 R ChemDraw 16.0 R BIOVIA Draw 2017 - MarvinSketch 17.17 - Indigo 1.3.0Beta.r16 - RDKit 2017.03.03 R DataWarrior 4.6.0 - CACTVS (NCI Resolver Aug 17) - OPSIN 2.3.1 R LexiChem (OEChem) 20170613 - ChemDoodle 7.0.2 - CDK 2.0 - JUMBO 6 -
  • 18. Rule 2 Jan 2015 Aug 2017 Centres R R JMol n/a R ACD/ChemSketch R R Balloon 1.6.5beta n/a R KnowItAll ChemWindow n/a R ChemDraw S S Accelrys/BIOVIA Draw S R MarvinSketch S S Indigo R R RDKit S S DataWarrior S S CACTVS S R OPSIN R R LexiChem (OEChem) S R ChemDoodle S n/a CDK S S JUMBO - - R or S? Let’s Vote https://nextmovesoftware.com/blog/2015/01/21/r-or-s-lets-vote/
  • 19. Rule 4b S S S R Centres 2.0 R JMol 14.20.3 R ACD/ChemSketch 14.05beta R Balloon 1.6.5beta R KnowItAll ChemWindow 2018 R ChemDraw 16.0 R BIOVIA Draw 2017 R MarvinSketch 17.17 R Indigo 1.3.0Beta.r16 R RDKit 2017.03.03 S DataWarrior 4.6.0 S CACTVS (NCI Resolver Aug 17) S OPSIN 2.3.1 - LexiChem (OEChem) 20170613 - ChemDoodle 7.0.2 s CDK 2.0 - JUMBO 6 -
  • 20. MANCUDE RINGS Centres 2.0 R R JMol 14.20.3 R R ACD/ChemSketch 14.05beta R R Balloon 1.6.5beta R R KnowItAll ChemWindow 2018 R R ChemDraw 16.0 R R BIOVIA Draw 2017 R R MarvinSketch 17.17 R R Indigo 1.3.0Beta.r16 S R RDKit 2017.03.03 R R DataWarrior 4.6.0 R R CACTVS (NCI Resolver Aug 17) S R OPSIN 2.3.1 S R LexiChem (OEChem) 20170613 S R ChemDoodle 7.0.2 S R CDK 2.0 S R JUMBO 6 S S I II I II
  • 21. Centres 2.0 R JMol 14.20.3 R ACD/ChemSketch 14.05beta R Balloon 1.6.5beta R KnowItAll ChemWindow 2018 R ChemDraw 16.0 R BIOVIA Draw 2017 R MarvinSketch 17.17 - Indigo 1.3.0Beta.r16 - RDKit 2017.03.03 - DataWarrior 4.6.0 - CACTVS (NCI Resolver Aug 17) - OPSIN 2.3.1 - LexiChem (OEChem) 20170613 - ChemDoodle 7.0.2 - CDK 2.0 - JUMBO 6 - AUX DESCRIPTORS
  • 22. hard to implement A MarvinSketch 17.17 (S) O O (S) OH (S) O O (R) OH Turning aromaticity on flips stereochemistry (e.g. CHEBI:16063) Labels depend on input order OH 1 (S)2 (r) 3 OH 4 (R) 5 OH 6 (S)7 OH 8 (s)9 HO 1 0 (R) 1 1 HO 1 2 (S)1 OH 2 OH 3 (R) 4 HO 5 OH 6 (R) 7 OH 8 (S)9 (R) 1 0 (R) 1 1 HO 1 2 (r) 1 OH 2 (s)3 HO 4 (S)5 (R) 6 (S) 7 (R)8 OH 9 OH1 0 HO 1 1 OH 1 2
  • 23. hard to implement B (R) OH H (CH2)2CH2HO OH (R) OH H (CH2)11(CH2)10HO OH OH H (CH2)17(CH2)16HO OH Becomes undefined distance ≥ 16 ChemDraw 16.0 (R) (s) (CH2)2 (R) OH (r) (s) (CH2)11 (R) OH
  • 24. open cip? Why? • Provide a blessed implementation that can be used directly or compared against • Toolkit agnostic library to facilitate downstream integration
  • 25. “FIX-CIP” CoLABORATION Robert Hanson (JMol), John Mayfield (Centres) Mikko Vainio (Balloon), Andrey Yerin (ACD/Name), Sophia Gillian Musacchio (St. Olaf College) Goals • Discuss and resolve software inconsistencies • Generate comprehensive test set based on BlueBook structure • Recomend rule amendments and additions Publication in preparation
  • 26. should you use CIP? Yes Systematic nomenclature Human conversation (if no pen is handy) Probably not (better algorithms exist) Unique labelling (see right) Compute “conversation” Finding/cleaning stereocentres No Relative comparison, e.g. substructure search
  • 27. should you use CIP? Yes Systematic nomenclature Human conversation (if no pen is handy) Probably not (better algorithms exist) Unique labelling (see right) Compute “conversation” Finding/cleaning stereocentres No Relative comparison, e.g. substructure search (S) (S) (R) (S) (R) (R) (S)(R) (S) (S) (R) (S) (R) (R) (S)(R)
  • 28. acknowledgements SciMix Poster Robert Hanson (JMol) Mikko Vainio (Balloon) Andrey Yerin (ACD/Name) Sophia Gillian Musacchio (St. Olaf College) Karl Nedwed (Bio-Rad) Noel O’Boyle (NextMove Software) Shuzhe Wang (NextMove Software) John Mayfield, Daniel Lowe and Roger Sayle NextMove Software Ltd, Cambridge, UK. NextMove Software Limited Innovation Centre (Unit 23) Cambridge Science Park Milton Road, Cambridge UK CB4 0EY www.nextmovesoftware.com Introduction Robert Hanson, Andrey Yerin, Mikko Vainio, and Sophia Gillian Musacchio for initiating and participating in the “Fix CIP” collaboration and the many in-depth technical discussions that have lead to improvements in the tools. Karl Nedwed for providing KnowItAll results. Philip Skinner for providing ChemDraw licenses. Noel O’Boyle for feedback and suggestions. the need for open-cip The Cahn-Ingold-Prelog (CIP) priority rules rank atoms around a stereogenic unit to assign a stereo-descriptor that is invariant to atom order and layout, for example R (right) or S (left) for tetrahedral atoms. A directed acyclic graph (digraph) is constructed for each stereogenic unit and the out edges from the root node compared and ranked according to eight sequence rules[1]. Each rule is applied exhaustively and tested on the entire digraph before applying the next rule[2]. Acknowledgements Results 1. P-92.1.3 Nomenclature of Organic Chemistry: IUPAC Recommendations and Preferred Names 2013 2. Paulina Mata. The CIP System Again:  Respecting Hierarchies Is Always a Must. J. Chem. Inf. Comput. Sci., 1999, 39 (6) Bibliography Conclusion The CIP sequence rules provide a standard way for chemists to effectively describe the configurations of most stereogenic units. However, beyond simple cases the complexity of the rules necessitates software is used as an aid to naming configurations. The results demonstrate even then, software implementations do not all agree on the configuration. Through the results presented here and the on-going effort of the Fix CIP collaboration, software should aim to converge upon consistent stereochemistry naming. An Open CIP software tool could provide “blessed” stereochemistry configuration names and provide a standard algorithm implementation for other vendors to integrate or adapt. Comparing Cahn-Ingold-Prelog Rule Implementations Rule 1 a. Higher atomic number precedes lower b. An atom node duplicated closer to the root ranks higher than one duplicated further Rule 2 Higher atomic mass number precedes lower Rule 3 Z precedes E and this precedes nonstereogenic (nst) double bonds Rule 4 a. Chiral stereogenic units precede pseudoasymmetric stereogenic units and these precede nonstereogenic units (R = S > r = s > nst) b. When two ligands have different descriptor pairs, the one with the first chosen like descriptor pairs has priority over the one with a corresponding unlike descriptor pairs c. r precedes s Rule 5 An atom or group with descriptor R has priority over its enantiomorph S Stereochemistry in Databases chebi_154 chembl_23 pubchem pubchem_substance eMolecules170601 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 % of Dataset Dataset Count 0 1 2 3 4 5 6 7 8 9 eMolecules (June 2017) PubChem Substance PubChem Compound (Aug 2017) ChEMBL 23 ChEBI 154 14 million records 234 million records 93 million records 1.7 million records 95 thousand records chebi_154 chembl_23 pubchem pubchem_substance eMolecules170601 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 % of Dataset Dataset Count 0 1 2 3 4 5 6 7 8 9 Number of Stereogenic Units + chebi_154 chembl_23 pubchem pubchem_substance eMolecules170601 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 % of Dataset Dataset Count 0 1 2 3 4 5 6 7 8 9 The number of defined stereogenic units per molecule varies between databases. The application of Rule 1a to the digraph for 2-butanol ranks the out edges connected to the root as giving the label S (4 > 2 > 5 are anticlockwise looking towards 6). ChEBI ChEMBL eMolecules PubChem Compound1 PubChem Substance Rule 1a 281K 99.6% 1.8M 98.6% 2.4M 97.0% 53.5M 100.0% 93.1M 98.7% Rule 1b 4 1 164 255 Rule 2 14 3,565 6,789 Rule 3 29 3 441 36 45 Rule 4a 122 126 273 4 12,770 Rule 4b 563 0.2% 4,037 0.2% 3,188 0.1% 125K 0.1% Rule 4c 19 558 Rule 5 285 0.1% 23.4K 1.2% 69K 2.8% 15 1.1M 1.2% Total 282K 1.9M 2.4M 53.5M 94.3M The majority of stereogenic units are constitutionally asymmetric and can be ranked using Rule 1a. However, in some datasets the number of stereogenic units requiring Rule 4b and 5 can be significant. I II III IV V VI VII VIII IX X XIa XIb XII XIII Centres 2.0 R R R R R R R R R r R R r R JMol 14.20.3 R R R R R R R R R r R R r R ACD/ChemSketch 14.05beta R R R R R R R R R r R R r R Balloon 1.6.5beta R R R R R R R R R r R R r R KnowItAll ChemWindow 2018 R R R R R R R R R r R R r R5 ChemDraw 16.0 R R R R S R R R R r R R r R BIOVIA Draw 2017 R R R - R R R R R -1 R R -1 R MarvinSketch 17.17 R - - - S R - R - r R R r - Indigo 1.3.0Beta.r16 -2 R - - R - R R R r S R - - RDKit 2017.03.03 S R S R S R R S R R R R - - DataWarrior 4.6.0 R R R - S R R S R R R3 R - - CACTVS (NCI Resolver Aug 17) R R S - S4 R R S R R S R - - OPSIN 2.3.1 R R R R R - - - - - S R - - LexiChem (OEChem) 20170613 R R - - R - - - - - S R - - ChemDoodle 7.0.2 R R - - S - - s - r S R - - CDK 2.0 - R R5 - S - - - - - S R - - JUMBO 6 R - S - - - - - - - S S - - Constitutional (Rule 1a, 1b, 2) Geometrical + Topographical (Rule 3,4a,4b,4c,5) Special (Mancude, Aux Descriptors) 1. Pseudoasymmetric r/s labels not displayed but must be calculated due to answers given for IX and XIII 2. Runtime error occurs 3. Impossible to test as different Kekulé forms are normalised 4. R in CACTVS since Feb 2015, NCI resolver is old version 5. Other descriptor is assigned differently A set of fourteen structures was collected to identify differences between software implementations. The structures were selected to cover all the sequence rules and their applications to special cases. Eight sequence rules (in essence) Fix CIP Collaboration Since submitting this work for presentation the developers: Centres, JMol, ACD/ ChemSketch, and Balloon have begun a collaboration. We are in the process of submitting for publication an extended in-depth validation set and proposing sequence rule refinements and additions where they are required. 1As part of the PubChem Compound’s processing, non-constitutional stereochemistry is removed: for example the nine stereoisomers of inositols are all represented by CID 892. Atoms connected by double and triple bonds as well as ring closures result in duplicated nodes in the digraph. In the structure below atoms 5 and 6 appear twice and atom 1 (the root) appears three times. Due to this duplication, complex ring systems can generate exponentially large digraphs that are not computationally tractable. Further complexity in digraphs is caused by the use of fractional atomic numbers in mancude ring-systems and assignment of auxiliary descriptors for applying Rules 3-5. H OH H H H H H H H H H 1 7 6 5 (1) (1) 65234 O O 3 4 2 1 6 5 7 7 O H H H H H H H H H 321 5 4 6 1 2 3 5 6 4 H