My Open Access papers

Open Access Publications of
Noel O’Boyle

November 2, 2011

Contents

I Cheminformatics toolkits 5
1 Pybel: a Python wrapper for the OpenBabel cheminformatics toolkit 7

2 Cinfony - combining Open Source cheminformatics toolkits behind a common interface 15

3 Open Babel: An open chemical toolbox 25

II Enzyme reaction mechanisms 39
4 MACiE: a database of enzyme reaction mechanisms 41

5 MACiE (Mechanism, Annotation and Classification in Enzymes): novel tools for search-
ing catalytic mechanisms 43

III QSAR 49
6 PYCHEM: a multivariate analysis package for python 51

7 Simultaneous feature selection and parameter optimisation using an artificial ant colony:
case study of melting point prediction 53

IV The Rest 69
8 Userscripts for the life sciences 71

9 Confab - Systematic generation of diverse low-energy conformers 83

10 Review of “Data Analysis with Open Source Tools” 93

11 Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years
on 95

3

Part I

Cheminformatics toolkits

5

Chemistry Central Journal
Software Open Access
Pybel: a Python wrapper for the OpenBabel cheminformatics
toolkit
Noel M O'Boyle*1,2, Chris Morley3 and Geoffrey R Hutchison4

Address: 1Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2
1EW, UK, 2Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, UK, 3OpenBabel Development Team and 4Department
of Chemistry, University of Pittsburgh, Chevron Science Center, 219 Parkman Avenue, Pittsburgh, PA 15260, USA
Email: Noel M O'Boyle* - baoilleach@gmail.com; Chris Morley - c.morley@gaseq.co.uk; Geoffrey R Hutchison - geoffh@pitt.edu
* Corresponding author

Published: 9 March 2008 Received: 23 January 2008
Accepted: 9 March 2008
Chemistry Central Journal 2008, 2:5 doi:10.1186/1752-153X-2-5
This article is available from: http://journal.chemistrycentral.com/content/2/1/5
© 2008 O'Boyle et al
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract
Background: Scripting languages such as Python are ideally suited to common programming tasks
in cheminformatics such as data analysis and parsing information from files. However, for reasons
of efficiency, cheminformatics toolkits such as the OpenBabel toolkit are often implemented in
compiled languages such as C++. We describe Pybel, a Python module that provides access to the
OpenBabel toolkit.
Results: Pybel wraps the direct toolkit bindings to simplify common tasks such as reading and
writing molecular files and calculating fingerprints. Extensive use is made of Python iterators to
simplify loops such as that over all the molecules in a file. A Pybel Molecule can be easily
interconverted to an OpenBabel OBMol to access those methods or attributes not wrapped by
Pybel.
Conclusion: Pybel allows cheminformaticians to rapidly develop Python scripts that manipulate
chemical information. It is open source, available cross-platform, and offers the power of the
OpenBabel toolkit to Python programmers.

Background OpenBabel is a C++ toolkit with extensive capabilities for
Cheminformaticians often need to write once-off scripts reading and writing molecular file formats (over 80 are
to create extract data from text files, prepare data for anal- supported) as well as for manipulating molecular data [2].
ysis or carry out simple statistics. Scripting languages such Many standard chemistry algorithms are included, for
as Perl, Python and Ruby are ideally suited to these day- example, determination of the smallest set of smallest
to-day tasks [1]. Such languages are, however, an order of rings, bond order perception, addition of hydrogens, and
magnitude or more slower than compiled languages such assignment of Gasteiger charges. In relation to cheminfor-
as C++. Since cheminformaticians regularly deal with matics, OpenBabel supports SMARTS searching [3],
molecular files containing thousands of molecules and molecular fingerprints [4] (both Daylight-type, and struc-
many cheminformatics algorithms are computationally tural-key based), and includes group contribution
expensive, cheminformatics toolkits are typically written descriptors for LogP [5], polar surface area (PSA) [6] and
in compiled languages for performance. molar refractivity (MR) [5].

Page 1 of 7
Chem. Cent. J. 2008, 2, 5. (page number not for citation purposes)

Chemistry Central Journal 2008, 2:5 http://journal.chemistrycentral.com/content/2/1/5

Of the current popular scripting languages, Python [7] is header files, SWIG generates a C file which, when com-
the de-facto standard language for scripting in cheminfor- piled and linked with the Python development libraries
matics. Several commercial cheminformatics toolkits have and OpenBabel, creates a Python extension module,
interfaces in Python: OpenEye's closed-source successor openbabel. This can then be imported into a Python script
to OpenBabel, OEChem [8], is a C++ toolkit with inter- like any other Python module using the "import openbabel"
faces in Python and Java; Rational Discovery's RDKit [9], statement.
which is now open source, is a C++ cheminformatics
toolkit with a Python interface; the Daylight toolkit [10] For a small number of C++ objects and functions, it was
from Daylight Chemical Information Systems, written in necessary to add some convenience functions to facilitate
C, only has Java and C++ wrappers but PyDaylight [11], access from Python. Certain types of molecule files have
available separately from Dalke Scientific, provides a additional data present in addition to the connection
Python interface to the toolkit; the Cambios Molecular table. OpenBabel stores these data in subclasses of OBGe-
Toolkit [12] from Cambios Consulting is a commercial nericData such as OBPairData (for the data fields in mol-
C++ toolkit with a Python interface. There are also toolkits ecule files such as MOL files and SDF files) and
entirely implemented in Python: Frowns [13], an open OBUnitCell (for the data fields in CIF files). To access the
source cheminformatics toolkit by Brian Kelley, and PyBa- data it is necessary to 'downcast' an instance of OBGener-
bel [14], an open source toolkit included in the MGLTools icData to the specific subclass. For this reason, two con-
package from the Molecular Graphics Labs at the Scripps venience functions were added to the interface file, one to
Research Institute. Note that the latter is not related to the cast OBGenericData to OBPairData, and one to cast to
OpenBabel project; rather its name derives from the fact OBUnitCell. Another convenience function was added to
that its aim was to implement in Python some of the func- convert a Python list to a C array of doubles, as this type
tionality of Babel v1.6 [15], a command-line application of input is required for a small number of OpenBabel
for converting file formats which is a predecessor of functions.
OpenBabel.
Iterators are an important feature of the OpenBabel C++
Here we describe the implementation and application of library. For example, OBAtomAtomIter allows the user to
Pybel, a Python module that provides access to the easily iterate over the atoms attached to a particular atom,
OpenBabel C++ library from the Python programming and OBResidueIter is an iterator over the residues in a
language. Pybel builds on the basic Python bindings to molecule. The OpenBabel iterators use the dereference
make it easier to carry out frequent tasks in cheminformat- operator to access the data, the increment operator to iter-
ics. It also aims to be as 'Pythonic' as possible; that is, to ate to the next element, and the boolean operator to test
adhere to Python language conventions and idioms, and whether any elements remain. Iterators are also a core fea-
where possible to make use of Python language features ture of the Python language. However, the iterators used
such as iterators. The result is a module that takes advan- by OpenBabel are not automatically converted into
tage of Python's expressive syntax to allow cheminforma- Python iterators. To deal with this, Python iterator classes
ticians to carry out tasks such as SMARTS matching, data that wrap the dereference, increment and boolean opera-
field manipulation and calculation of molecular finger- tors behind the scenes were added to the SWIG interface
prints in just a few lines of code. file, so that Python statements such as "for
attached_obatom in OBAtomAtomIter(obatom)" work with-
Implementation out problem.
SWIG bindings
Python bindings to the OpenBabel toolkit were created Pybel module
using SWIG [16]. SWIG (Simplified Wrapper and Inter- The SWIG bindings provide direct access from Python to
face Generator) is a tool that automates the generation of the C++ objects and functions in the OpenBabel API
bindings to libraries written in C or C++. One of the (application programming interface). The purpose of the
advantages of SWIG compared to other automated wrap- Pybel module is to wrap these bindings to present a more
ping methods such as Boost.Python [17] or SIP [18] is that Pythonic interface to OpenBabel (Figure 1). This extra
SWIG also supports the generation of bindings to several level of abstraction is useful as Python programmers
other languages. For example, OpenBabel also uses SWIG expect Python libraries to behave in certain ways that a
to generate bindings for Perl, Ruby and Java. An addi- C++ library does not. For example, in Python, attributes of
tional advantage is that SWIG will directly parse C or C++ an object are often directly accessed whereas in C++ it is
header files while Boost.Python and SIP require each C++ typical to call Get/Set functions to access them. A C++
class to be exposed manually. The input to SWIG is an function returning a particular object might require a
interface file containing a list of OpenBabel header files pointer to an empty object as a parameter, whereas the
for which to generate bindings. Using the signatures in the Python equivalent would not. Even something as simple

Page 2 of 7


code shows how to store each molecule in a multimole-
cule SDF file in a list called allmols:

import openbabel

allmols = []

obconversion = openbabel.OBConversion()

obconversion.SetInFormat("sdf")

obmol = openbabel.OBMol()

notatend = obconversion.ReadFile(obmol,
"inputfile.sdf")

while notatend:

allmols.append(obmol)

obmol = openbabel.OBMol()

notatend = obconversion.Read(obmol)

To replace this somewhat verbose code, Pybel provides a
readfile method that takes a file format and filename and
returns molecules using the 'yield' keyword. This changes
the method into a 'generator', a Python language feature
where a method behaves like an iterator. Iterators are a
major feature of the Python language which are used for
looping over collections of objects. In Pybel, we have used
iterators where possible to simplify access to the toolkit.
As a result, the equivalent to the preceding code is:
Figure
text and1the OpenBabel C++ library
The relationship between Python modules described in the
The relationship between Python modules described import pybel
in the text and the OpenBabel C++ library. Python
modules are shown in green; the C++ library is shown in allmols = [mol for mol in pybel.read
blue. file("sdf", "inputfile.sdf")]

The benefits of iterator syntax are clear when dealing with
as differences in the conventions for the case of letters multimolecule files. For single molecule files, however,
used in variable and method names is a problem, as it the user needs to remember to explicitly request the itera-
makes it more likely for Python programmers to intro- tor to return the first and only molecule using the next
duce bugs in their code. method:

One of the key aims of Pybel was to reduce the amount of mol = pybel.readfile("mol", "input
code necessary to carry out common tasks. This is espe- file.mol").next()
cially important for a scripting language where program-
ming is often done interactively at a command prompt. In Pybel provides replacements for two of the main classes in
addition, as for any programming language, repeated the OpenBabel library, OBMol and OBAtom. The follow-
entry of code for routine and common tasks (so-called ing discussion describes the Pybel Molecule class which
'boilerplate code') is a common cause of errors in code. wraps an instance of OBMol, but the same design princi-
Reading and writing molecule files is one of the most ples apply to the Pybel Atom class. Table 1 summarises
common tasks for users of OpenBabel but requires several the attributes and methods of the Molecule object. By
lines of code if using the SWIG bindings. The following wrapping the base class, Pybel can enhance the Molecule

Page 3 of 7


Table 1: Attributes and methods supported by the Pybel Molecule object

Attribute Description*

OBMol The underlying OBMol object
atoms A list of Pybel Atoms
charge The total charge (GetTotalCharge)
data A MoleculeData object for access to data fields
dim The dimensionality of the coordinates (GetDimension)
energy The heat of formation (GetEnergy)
exactmass The mass calculated using isotopic abundance (GetExactMass)
flags The set of flags used internally by OpenBabel (GetFlags)
formula The stoichiometric formula (GetFormula)
mod The number of nested BeginModify() calls (Internal use) (GetMod)
molwt The standard molar mass (GetMolWt)
spin The total spin multiplicity (GetTotalSpinMultiplicity)
sssr The smallest set of smallest rings (GetSSSR)
title The title of the molecule (often the filename) (GetTitle)
unitcell Unit cell data (if present)

Method
write Write the molecule to a file or return it as a string
calcfp Return a molecular fingerprint as a Fingerprint object
calcdesc Return the values of the group contribution descriptors
__iter__ Enable iteration over the Atoms in the Molecule

*Where a Molecule attribute is a direct replacement for a 'Get' method of the underlying OBMol, the name of the method is given in parentheses.

object by providing (1) direct access to attributes rather # Using Pybel
than through the use of Get methods, (2) additional
attributes of the object, and (3) additional methods that value = pybel.Molecule(mol).data ["com
act on the object. ment"]

(1) As mentioned earlier, it is typical in Python to access It should be noted that all of these attributes are calculated
attribute values directly rather than using Get/Set meth- on-the-fly rather than stored for future access as the under-
ods. With this in mind, the Molecule class adds attributes lying OBMol may have been modified.
such as energy, formula and molwt (among others) which
give the values returned by calling GetEnergy(), GetFor- (3) Four additional methods have been added to the
mula() and GetMolWt(), respectively on the underlying Pybel Molecule (Table 1). The first is a write method
OBMol (see Table 1 for the full list). which writes a representation of the Molecule to a file and
takes care of error handling. As with reading molecules
(2) One of the aims of Pybel is to simplify access to some from files (see above), this method simplifies the proce-
of the most common attributes. With this in mind, an dure significantly compared to using the SWIG bindings
atoms attribute has been added which returns a list of the directly. In addition, a calcfp method and a calcdesc
atoms of the molecule as Pybel Atoms. Access to the data method have been added which calculate a binary finger-
fields associated with a molecule has been simplified by print for the molecule, and some descriptor values, respec-
creation of a MoleculeData object which is returned when tively. In the OpenBabel library these are not methods of
the data attribute of a Molecule is accessed. MoleculeData the OBMol, but rather are loaded as plugins (by OBFin-
presents a dictionary interface to the data fields of the gerprint.FindFingerprint and OBDescriptor.FindType,
molecule. Accessing and updating these field is more con- respectively) to which an OBMol is passed as input. The
voluted if using the SWIG bindings. Compare the follow- __iter__ method is a special Python method that enables
ing statements for accessing the "comment" field of the iteration over an object; in the case of a Molecule, the
variable mol, an OBMol: defined iterator loops over the Atoms of the Molecule.
This feature enables constructions such as "for atom in
# Using the SWIG bindings mol" where mol is a Pybel Molecule.

value = openbabel.toPairData(mol.GetData SMARTS is a query language developed by Daylight
["comment"]).GetValue() Chemical Information Systems for molecular substructure

Page 4 of 7


searching [3]. As implemented in the OpenBabel toolkit, The OBMol wrapped by a Pybel Molecule can be accessed
finding matches of a particular substructure in a particular through the OBMol attribute. This makes it easy to call a
molecule is a four step process that involves creating an method not wrapped by Pybel, such as OBMol.NumRotors,
instance of OBSmartsPattern, initialising it with a which returns the number of rotatable bonds in a mole-
SMARTS pattern, searching for a match, and finally cule:
retrieving the result:
mol = pybel.readfile("mol", "input
obsmarts = openbabel.OBSmartsPattern() file.mol").next()

obsmarts.Init("[#6] [#6]") numrotors = mol.OBMol.NumRotors()

obsmarts.Match(obmol) Documentation and Testing
To minimise programming errors, programs written
results = obsmarts.GetUMapList() dynamically-typed languages such as Python should be
tested comprehensively. Pybel has 100% code coverage in
Since a SMARTS query can be thought of as a regular terms of unit tests, as measured by Ned Batchelder's cov-
expression for molecules, in Pybel we decided to wrap the erage.py [19]. It also has several doctests, short snippets of
SMARTS functionality in an analogous way to Python's Python code included in documentation strings which
regular expression module, re. With these changes, the serve as both examples of usage and as unit tests.
same process takes only two steps, an initialisation step
and a search step: The Pybel API is fully documented with docstrings. These
can be accessed in the usual way with the help() com-
smarts = pybel.Smarts("[#6] [#6]") mand at the interactive Python prompt after importing
Pybel: for example, "help(pybel.Molecule)". In addition, the
results = smarts.findall(pybelmol) OpenBabel Python web page [20] contains a complete
description of how to use the SWIG bindings and the
Pybel was not written to replace the SWIG bindings but Pybel API. The webpage also contains links to HTML ver-
rather to make it simpler to perform common tasks. As a sions of the OpenBabel API documentation and Pybel API
result, Pybel does not attempt to wrap every single documentation. The latter is included in Additional File 1.
method and class in the OpenBabel library. Because of
this, a user may often want to interconvert between an Results and Discussion
OBMol and a Molecule, or an OBAtom and an Atom. This The principle aim of Pybel is to make it simpler to use the
is quite a straightforward process. A Pybel Molecule can be OpenBabel toolkit to carry out common tasks in chem-
created by passing an OBMol to the Molecule constructor. informatics. These common tasks include reading and
In the following example an OBMol is created using the writing molecule files, accessing data fields of a molecule,
SWIG bindings and then written to a file using Pybel: computing and comparing molecular fingerprints and
SMARTS matching. Here we present some examples that
obmol = openbabel.OBMol() illustrate how Pybel may be used to carry out common
cheminformatics tasks.
a = obmol.NewAtom()
Removal of duplicate molecules
a.SetAtomicNum(6) When merging different datasets or as a final step in pre-
processing, it may be necessary to identify and remove
a.SetVector(0.0, 1.0, 2.0) # Set coordi duplicate molecules. In the following example, only the
nates unique molecules in the multimolecule SDF file "input-
file.sdf" will be written to "uniquemols.sdf". Here we will
b = obmol.NewAtom() assume that a unique InChI string (IUPAC International
Chemical Identifier) indicates a unique molecule. A simi-
obmol.AddBond(1, 2, 1) # Single bond from lar procedure could be performed using the OpenBabel
Atom 1 to Atom 2 canonical SMILES format, by replacing "inchi" with "can"
in the following:
pybel.Molecule(obmol).write("mol", "out
putfile.mol") import pybel

inchis = []

Page 5 of 7


output = pybel.Outputfile("sdf", ties. This is the Lipinski Rule of Fives, so-called as the
"uniquemols.sdf") numbers involved are all multiples of five. The following
example shows how to filter a database to identify only
for mol in pybel.readfile("sdf", "input those molecules that pass all four of the Lipinski criteria.
file.sdf"): The values of the Lipinski descriptors are also added to the
output file as data fields. Note that whereas molecular
inchi = mol.write("inchi") weight is directly available as an attribute of a Molecule,
and LogP is available as one of the three group contribu-
if inchi not in inchis: tion descriptors calculated by OpenBabel, we need to use
SMARTS pattern matching to identify the number of
output.write(mol) hydrogen bond donors and acceptors. The SMARTS pat-
terns used here correspond to the definitions of hydrogen
inchis.append(inchi) bond donor and acceptor used by Lipinski:

output.close() import pybel

Selection of similar molecules HBD = pybel.Smarts("[#7,#8;!H0]")
Another common task in cheminformatics is the selection
of a set of molecules of similar structure to a target mole- HBA = pybel.Smarts("[#7,#8]")
cule. Here we will assume that structural similarity is indi-
cated by a Tanimoto coefficient [21] of at least 0.7 with def lipinski(mol):
respect to Daylight-type (that is, based on hashed paths
through the molecular graph) fingerprints. Note that """Return the values of the Lipinski
Pybel redefines the | operator (bitwise OR) for Fingerprint descriptors."""
objects as the Tanimoto coefficient:
desc = {'molwt': mol.molwt,
import pybel
'HBD': len(HBD.findall(mol)),
targetmol = pybel.readfile("sdf", "target
mol.sdf").next() 'HBA': len(HBA.findall(mol)),

targetfp = targetmol.calcfp() 'LogP': mol.calcdesc(['LogP'])
['LogP']}
output = pybel.Outputfile("sdf", "similar
mols.sdf") return desc

for mol in pybel.readfile("sdf", "input passes_all_rules = lambda desc: (desc
file.sdf"): ['molwt'] <= 500 and

fp = mol.calcfp() desc ['HBD'] <= 5 and desc
['HBA'] <= 10 and
if fp | targetfp >= 0.7:
desc ['LogP'] <= 5)
output.write(mol)
if __name__=="__main__":
output.close()
output = pybel.Outputfile("sdf", "pas
Applying a Rule of Fives filter sLipinski.sdf")
In an influential paper, Lipinski et al. [22] performed an
analysis of drug compounds that reached Phase II clinical for mol in pybel.readfile("sdf",
trials and found that they tended to occupy a certain range "inputfile.sdf"):
of values for molecular weight, LogP, and number of
hydrogen bond donors and acceptors. Based on this, they descriptors = lipinski(mol)
proposed a rule with four criteria to identify molecules
that might have poor absorption or permeation proper- if passes_all_rules(descriptors):

Page 6 of 7


mol.data.update(descriptors) Additional material

output.write(mol)
Additional file 1
Pybel API. The HTML documentation of the Pybel API (application pro-
output.close() gramming interface).
Click here for file
Future work [http://www.biomedcentral.com/content/supplementary/1752-
The future development of Pybel is closely linked to any 153X-2-5-S1.zip]
changes and improvements to OpenBabel. With each new
release of the OpenBabel API, the SWIG bindings will be
updated to include any additional functionality. How-
ever, additions to the Pybel API will only occur if they sim- Acknowledgements
plify access to new features of the OpenBabel toolkit of The idea for the Pybel module was inspired by Andrew Dalke's work on
PyDaylight [11]. We thank the anonymous reviewers for their helpful com-
general use to cheminformaticians. In general, the Pybel
ments.
API can be considered stable, and an effort will be made
to ensure that future changes will be backwards compati-
References
ble. 1. Ousterhout JK: Scripting: Higher Level Programming for the
21st Century. [http://home.pacbell.net/ouster/scripting.html].
Conclusion 2. OpenBabel v.2.1.1 [http://openbabel.sf.net]
3. SMARTS – A Language for Describing Molecular Patterns
Pybel provides a high-level Python interface to the widely- [http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html]
used OpenBabel C++ toolkit. This combination of a high 4. Flower DR: On the properties of bit string-based measures of
chemical similarity. J Chem Inf Comput Sci 1998, 38:379-386.
performance cheminformatics toolkit and an expressive 5. Wildman SA, Crippen GM: Prediction of physicochemical
scripting language makes it easy for cheminformaticians parameters by atomic contributions. J Chem Inf Comput Sci
to rapidly and efficiently write scripts to manipulate 1999, 39:868-873.
6. Ertl P, Rohde B, Selzer P: Fast calculation of molecular polar
molecular data. surface area as a sum of fragment-based contributions and
its application to the prediction of drug transport properties.
Pybel is freely available from the OpenBabel web site2 J Med Chem 2000, 43:3714-3717.
7. Python [http://www.python.org]
both as part of the OpenBabel source distribution and for 8. OEChem: OpenEye Scientific Software: Santa Fe, NM. .
Windows as an executable installer. Compiled versions 9. RDKit [http://www.rdkit.org]
10. Daylight Toolkit: Daylight Chemical Information Systems,
are also available as packages in some Linux distributions Inc.: Aliso Viejo, CA. .
(openbabel-python in Fedora, for example). 11. PyDaylight: Dalke Scientific Software, LLC: Santa Fe, NM. .
12. Cambios Molecular Toolkit: Cambios Computing, LLC: Palo
Alto, CA. .
Availability and Requirements 13. Frowns [http://frowns.sf.net]
Project name: Pybel 14. PyBabel in MGLTools [http://mgltools.scripps.edu]
15. Babel v.1.6 [http://smog.com/chem/babel/]
16. SWIG v.1.3.31 [http://www.swig.org]
Project home page: http://openbabel.sf.net/wiki/Python 17. Boost.Python [http://www.boost.org/libs/python/doc/]
18. SIP – A Tool for Generating Python Bindings for C and C++
Libraries [http://www.riverbankcomputing.co.uk/sip/]
Operating system(s): Platform independent 19. coverage.py [http://nedbatchelder.com/code/modules/cover
age.html]
Programming language: Python 20. OpenBabel Python [http://openbabel.sourceforge.net/wiki/
Python]
21. Jaccard P: La distribution de la flore dans la zone alpine. Rev
Other requirements: OpenBabel Gen Sci Pures Appl 1907, 18:961-967.
22. Lipinski CA, Lombardo F, Dominy BW, Feeney PJ: Experimental
and computational approaches to estimate solubility and
License: GNU GPL permeability in drug discovery and development settings.
Adv Drug Del Rev 1997, 23:3-25.
Any restrictions to use by non-academics: None

Authors' contributions
GRH is the lead developer of OpenBabel and created the
SWIG bindings. NMOB developed Pybel, and extended
the SWIG interface file. CM compiled the SWIG bindings
on Windows and added convenience functions to the
OpenBabel API to facilitate access from scripting lan-
guages. All authors read and approved the final manu-
script.

Page 7 of 7

Cinfony – combining Open Source cheminformatics toolkits behind
a common interface
Noel M O'Boyle*1 and Geoffrey R Hutchison2

Address: 1Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, UK and 2Department of Chemistry, University of
Pittsburgh, Chevron Science Center, 219 Parkman Avenue, Pittsburgh, PA 15260, USA
Email: Noel M O'Boyle* - oboyle@ccdc.cam.ac.uk; Geoffrey R Hutchison - geoffh@pitt.edu

Published: 3 December 2008 Received: 9 October 2008
Accepted: 3 December 2008

Abstract
Background: Open Source cheminformatics toolkits such as OpenBabel, the CDK and the RDKit
share the same core functionality but support different sets of file formats and forcefields, and
calculate different fingerprints and descriptors. Despite their complementary features, using these
toolkits in the same program is difficult as they are implemented in different languages (C++ versus
Java), have different underlying chemical models and have different application programming
interfaces (APIs).
Results: We describe Cinfony, a Python module that presents a common interface to all three of
these toolkits, allowing the user to easily combine methods and results from any of the toolkits. In
general, the run time of the Cinfony modules is almost as fast as accessing the underlying toolkits
directly from C++ or Java, but Cinfony makes it much easier to carry out common tasks in
cheminformatics such as reading file formats and calculating descriptors.
Conclusion: By providing a simplified interface and improving interoperability, Cinfony makes it
easy to combine complementary features of OpenBabel, the CDK and the RDKit.

Background In general, all of these toolkits share the same core func-
Cheminformatics toolkits are essential to the day-to-day tionality although the implementation details and under-
work of the practising cheminformatician. They enable lying chemical model may differ. However, as a result of
the user to deal with such tasks as handling different their independent development and history, each has
chemistry file formats, substructure searching, calculation functionality specific to itself and each toolkit supports
of molecular fingerprints, and structure diagram genera- different sets of file formats and forcefields, and can calcu-
tion. The main Open Source cheminformatics libraries late different molecular fingerprints and molecular
under active development are OpenBabel [1], the Chem- descriptors (Table 1). Despite the diversity of these
istry Development Kit (CDK) [2], and the RDKit [3]. toolkits and the potential benefits in being able to access
OpenBabel is a C++ toolkit with bindings in Perl, Python, all of them at the same time, there has been little work on
Ruby and Java, the CDK is a Java toolkit, while the RDKit interoperability between them. This has resulted in a bal-
is another C++ toolkit with Python bindings. While the kanization of this field such that users of one toolkit rarely
CDK has its origins in academia, both OpenBabel and the use another toolkit.
RDKit originated in companies (OpenEye and Rational
Discovery, respectively) and have subsequently been One way to achieve interoperability of chemical toolkits is
developed by the community under Open Source licenses. through the use of standard file formats for exchange of

Page 1 of 10


Table 1: Some features of toolkits which are not shared by all three toolkits.

CDK
A large number of descriptors (some overlap with RDKit)
Pharmacophore searching (like RDKit*)
Calculation of maximum common substructure
2D structure layout (like RDKit) and depiction
MACCS keys (also RDKit) and E-State fingerprints
Integration with the R statistical programming environment
Support for mass-spectrometry analysis (representations for cleavage reactions, structure generation from formulae)
Fragmentation schemes (ring fragments, Murcko)
3D structure generation using a template and heuristics (like OpenBabel)
3D similarity using ultrafast shape descriptors
Gasteiger π charge calculation

OpenBabel
Not just focused on cheminformatics
Supports a very large number of chemical file formats including quantum mechanics file formats, molecular mechanics trajectories, 2D sketchers
3D structure generation using a template method (like CDK)
Included in all major Linux distributions
Bindings available from several scripting languages apart from Python, as well as the Java and .NET platforms
Conformation generation and searching
InChI (also CDK) and InChIKey generation
Support for crystallographic space groups
Several forcefield implementations: UFF (also RDKit), MMFF94, MMFF94s, Ghemical
Ability to add custom data types to atoms, bonds, residues, molecules

RDKit
A large number of descriptors (some overlap with CDK)
Fragmentation using RECAP rules
2D coordinate generation (like CDK) and depiction
3D coordinate generation using geometry embedding
Calculation of Cahn-Ingold-Prelog stereochemistry codes (R/S)
Pharmacophore searching (like CDK)
Calculation of shape similarity (based on volume overlap)
Chemical reaction handling and transforms
Atom pairs and topological torsions fingerprints
Feature maps and feature-map vectors
Machine-learning algorithms

* Where the term "like" is used, it indicates that the implementation details differ.

data. For example, the CML project has defined a stand- models between different toolkits, and differences in the
ardised XML format for chemical data [4], with successive API for core cheminformatics tasks shared by the toolkits.
releases refining and extending the original standard. The
OpenSMILES effort [5] has attempted to resolve ambigui- Here we describe Cinfony, a Python module that over-
ties in the published SMILES definition [6] to create a comes these barriers to provide interoperability at the API
standard. While these efforts deserve support, they face level. Cinfony allows access to OpenBabel, the CDK, and
inevitable problems achieving consensus and they require the RDKit through a common interface, and uses a simple
changes to existing software to support the standard. The yet robust method to pass chemical models between
large number of chemical file formats supported by toolkits. Pybel, one of the components of Cinfony, has
OpenBabel (currently over 80) illustrates both the poten- been described previously [7]. It provides access to
tial of achieving a standard as well as the difficulties. OpenBabel from standard Python. In this work, we show
that the API developed for Pybel may be considered a
An alternative is interoperability at the API (application generic API for accessing any cheminformatics toolkit. We
programming interface) level. This has the advantage that describe the design and implementation of the Cinfony
it does require any changes to existing software. However, API for OpenBabel, the RDKit and the CDK. Next, we
there are at least three barriers to overcome: the need for a show how Cinfony simplifies the process of accessing the
programming language that can access all the toolkits toolkits and how it can be used in practice to combine the
simultaneously, the difficulty of exchanging chemical power of the three Open Source toolkits. Finally, we dis-

Page 2 of 10


cuss performance and some results from comparisons of Although the OBMol of OpenBabel has a corresponding
the toolkits. method, OBMol.AddHydrogens(), the RDKit uses a glo-
bal method, AddHs(Mol), while the CDK requires the
Implementation user to instantiate a HydrogenAdder object, which can
Common Application Programming Interface then be used to add hydrogens.
Cinfony presents the same interface to three cheminfor-
matics toolkits, OpenBabel, the CDK and the RDKit. The Molecule methods described in the original Pybel API
These are available through three separate modules: oba- [7] have been extended to handle hydrogen addition and
bel, cdk and rdkit. The API is designed to make it easy to removal, structure diagram generation, assignment of 3D
carry out many of the common tasks in cheminformatics, geometry to 0D structures and geometry optimisation
and covers the core functionality shared by all of the using forcefields. Both the CDK and the RDKit are capable
toolkits. Table 2 gives an overview of the API. The com- of 2D coordinate generation and 2D depiction. However,
plete API is available here (see Additional file 1). since OpenBabel currently has neither of these capabili-
ties, a fourth toolkit, OASA, is used by Pybel for this pur-
The main class containing chemical information is the pose. OASA is a lightweight cheminformatics toolkit
Molecule class. Rather than create a new chemical model, implemented in Python [8].
the Molecule class is a light wrapper around the molecule
object in the underlying library, for example, around A new development in the latest version of OpenBabel is
OBMol in the case of OpenBabel. Attribute values such as 3D coordinate generation and geometry optimisation
the molecular weight are calculated dynamically by query- using one of a number of forcefields. Since these methods
ing the underlying molecule. This ensures that if the are also available in the RDKit, and are under develop-
underlying OBMol, for example, is altered, the attribute ment in the CDK, two additional methods have been
values returned will still be correct. The actual underlying added to the Cinfony Molecule: make3D(), for 3D coor-
object (an OpenBabel OBMol, a CDK Molecule, or an dinate generation, and localopt(), for geometry optimisa-
RDKit Mol) can be accessed directly at any point. tion. Particularly in the case of OpenBabel, these new
methods simplify the process of generating 3D coordi-
The Molecule class also contains several methods that act nates. Compare a single call to make3D() in Cinfony with
on molecules such as methods for calculating fingerprints, the following OpenBabel code:
adding hydrogens, and calculating descriptor values. This
makes it easy to access these methods, and also brings structuregenerator = openbabel.OBOp.Find
them to the attention of the user. In the underlying toolkit Type('Gen3D')
these methods may not be present as part of the molecule
class, and in fact, they can be difficult to find in the structuregenerator.Do(mol)
toolkit's API. For example, the Cinfony method Mole-
cule.addh() adds explicit hydrogens to the molecule. mol.AddHydrogens()
Table 2: An overview of the Cinfony API.

Class name Purpose

Molecule Wraps a molecule instance of the underlying toolkit and provides access to methods that act on molecules
Atom Wraps an atom instance of the underlying toolkit
MoleculeData Provides dictionary-like access to the information contained in the tag fields in SDF and MOL2 files
Outputfile Handles multimolecule output file formats
Smarts Wraps the SMARTS functionality of the toolkit in an analogous way to the Python 're' module for regular expression matching
Fingerprint Simplifies Tanimoto calculation of binary fingerprints

Function name
readfile Return an iterator over Molecules in a file
readstring Return a Molecule

Variable name
descs A list of descriptor IDs
forcefields A list of forcefield IDs
fps A list of fingerprint IDs
informatsaa A list of input format IDs
outformats A list of output format IDs

Page 3 of 10


ff = openbabel.OBForceField.Find translation process is transparent to the user. However,
Type("MMFF94") the user should be aware of known limitations of particu-
lar readers or writers. For example, the SMILES parser in
ff.Setup(mol) CDK 1.0.3 ignores atom-based stereochemistry and thus
that information is lost if a 0D rdkit or obabel Molecule
ff.SteepestDescent(50) with atom-based stereochemistry is converted to a cdk
Molecule.
ff.GetCoordinates(mol)
Cinfony Molecules are interconverted using the Mole-
The Cinfony API is identical for all of the toolkits. How- cule() constructor. For example, if obabelmol is an obabel
ever, the values returned by particular API calls are not Molecule, then the corresponding rdkit Molecule can be
necessarily standardised across toolkits. This Cinfony constructed using rdkit.Molecule(pybelmol). This mecha-
design decision is in agreement with the Principle of Least nism can also be used to interface Cinfony to other chem-
Surprise [9]; when the user accesses the underlying toolkit informatics toolkits. The only requirements are that the
directly, they will get the same result as found when using object passed to the Molecule() constructor needs to have
Cinfony. This design decision places the responsibility on a _cinfony attribute set to True, and an _exchange
the user to become familiar with differences in how the attribute containing a tuple (0, SMILES string) or (1, MOL
toolkits behave. For example, all of the toolkits allow the file) depending on whether the molecule is 0D or not.
calculation of path-based fingerprints. These encode all
paths in the molecular graph up to a path length of P into Implementation
a binary vector of length V, but the default values for V The Python scripting language has two main implementa-
and P are different for each toolkit: 1024 and 7 for tions. The most widely used implementation is the origi-
OpenBabel, 1024 and 8 for the CDK, and 2048 and 7 for nal reference implementation of Python in C, referred to
RDKit. Although it is possible to alter these parameters for as CPython when necessary to distinguish it from other
the CDK and the RDKit and so standardise V and P to implementations. The next most widely used implemen-
1024 and 7 for all of the toolkits, it is reasonable to tation is Jython, an implementation of Python in Java.
assume that the developers of each package have chosen Although most users of Python do so through CPython,
sensible defaults. In addition, the implementation details Jython scripts have the advantage of being able to access
of each of the fingerprinters would still be different; for Java libraries natively. They can also be compiled into Java
example, the RDKit sets four bits when hashing each classes to be used from Java programs. Jython scripts are
molecular path, the others set one; OpenBabel does not also useful in contexts where Java is required but it is more
set any bits for the one-atom fragments, N, C and O. convenient to work in Python; for example, to implement
a Java web servlet or a node in a Java workflow environ-
Interoperability ment such as KNIME [11].
The ability to transfer chemical models between toolkits is
essential to the goal of interoperability. However, the As discussed earlier, one of the barriers to interoperability
internal representation of a molecule is specific to a paris the requirement for a programming language that can
ticular toolkit. For example, as well as the connection simultaneously access more than one of the toolkits. From
table and coordinates (if present), it may include derived CPython it is possible to use Cinfony modules to connect
data relating to aromaticity, the number of implicit hydro- to OpenBabel (pybel), the CDK (cdkjpype) and the RDKit
gens on an atom, or stereochemical configuration. Fortu- (rdkit). From Jython, there are modules for OpenBabel
nately, the problem of transfer and storage of chemical (jybel) and the CDK (cdkjython). Convenience modules
information has already been solved by the development obabel and cdk are provided that automatically import the
of molecular file formats, of which over 80 are now sup- appropriate OpenBabel or CDK module depending on
ported by OpenBabel. Specifically, the MDL MOL file for- the Python implementation. The relationship between
mat [10] and the SMILES format [5,6] are shared by all these Cinfony modules and the underlying cheminfor-
three toolkits, and are used by Cinfony to exchange informatics libraries is summarised in Figure 1.
mation on molecules with 2D or 3D coordinates (MOL
file format), and no coordinates (SMILES format), respec- pybel and jybel
tively. OpenBabel provides SWIG [12] bindings for both CPy-
thon and Java (among other languages). pybel is a wrapper
By using existing file formats rather than trying to inter- around the CPython bindings, and has previously been
convert the internal models themselves, Cinfony takes described in detail [7]. jybel is an implementation of the
advantage of the existing input/output code of each Cinfony API that allows the user to access OpenBabel
toolkit which is well-tested and mature. In addition, the from Jython using the Java bindings. Despite the fact that

Page 4 of 10


rdkit
Support for Python scripting has been part of the design
of the RDKit from the start. The Python bindings in RDKit
were created using Boost.Python [14], a framework for
interfacing Python and C++. The Cinfony module rdkit
uses these bindings to implement its API. It is currently
not possible to access RDKit from Jython. RDKit has only
preliminary support for Java bindings; when these are
complete, a corresponding module will be added to Cin-
fony.

Dependency handling
A fully-featured installation of Cinfony relies on a large
Figure 1
Relationship of Cinfony modules to Open Source toolkits number of open source libraries. In particular, the 2D
Relationship of Cinfony modules to Open Source depiction capabilities introduce dependencies on several
toolkits. Python modules are accessible from CPython graphics libraries which may be problematic to install on
(green), Jython (pale blue), or both (striped green and pale a particular platform (Cairo and its Python bindings,
blue). Java libraries are indicated by dark blue, while C++ Python Imaging Library, AGG and the Python wrapper
libraries are yellow. AggDraw). With this in mind, Cinfony treats all depend-
encies as optional and only raises an Exception if the user
calls a method or imports a module that requires a miss-
ing dependency.

jybel is used from a Java implementation of Python, and For example, the Python Imaging Library (PIL) is required
accesses a C++ library through the Java Native Interface for displaying a 2D depiction on the screen. If all of the
(JNI), the jybel code differs from pybel in very few respects. components of cinfony are installed except for PIL, Cin-
In Jython, it is not possible to iterate directly over the fony works perfectly except that an Exception is raised if
wrapped STL vectors used by OpenBabel as their Java the Molecule.draw() method is called with show = True
SWIG bindings do not implement the Iterable interface. (the default). The image can however be written to a file
Also, the current Jython implementation is 2.2 and does without problems (show = False, filename =
not support generator expressions, which were introduced "image.png"). Similarly, if a user is only interested in
in Python 2.4. Although both C++ and Python have the using the CDK and the RDKit, it is not necessary to install
concept of a global function or variable, this is not the OpenBabel.
case in Java. SWIG places such functions, and get/set
methods for accessing the variables, in a special class Full installation instructions for Windows, MacOSX and
named openbabel. Global constants are placed in another Linux are available from the Cinfony website. It should be
class called openbabelConstants. A convenience module, noted that for Windows users, there is no need to compile
obabel, is provided which automatically imports the or search for missing libraries as the dependencies are
appropriate module depending on the Python implemen- included as binaries in the Cinfony distribution.
tation.
Results
cdkjpype and cdkjython Cinfony API
Since Jython runs on top of the Java Virtual Machine The original Pybel API was designed to make it easy to use
(JVM), it can access Java libraries such as the CDK OpenBabel to perform the most common tasks in chem-
natively. To access Java libraries from CPython, the informatics and to do so using idiomatic Python. Subse-
Python library JPype [13] is needed. This starts an instance quently, we realised that the resulting API could be
of the JVM and uses the JNI to communicate back and considered a generic API for wrapping the core function-
forth. Overall, the differences between the two wrappers ality of any cheminformatics toolkit. Cinfony implements
are minor. Jython and JPype differ in the syntax used to an extended version of the original Pybel API for the CDK
handle Java exceptions. Also, JPype returns unicode and the RDKit, as well as OpenBabel. While the original
strings from the CDK and these need to be converted to Pybel was restricted to CPython, Cinfony can also be used
regular strings (otherwise problems arise if they are passed from Jython to access the CDK and OpenBabel.
to an OpenBabel method expecting a std::string). The
appropriate CDK wrapper, cdkjpype or cdkjython, will be Cinfony helps cheminformaticians avoid the steep learn-
imported if the user imports the convenience module cdk. ing curve associated with starting to use a new toolkit.

Page 5 of 10


With Cinfony, all of the core functionality of the toolkits targetfp = targetmol.calcfp()
can be accessed with the same interface. For example, in
Cinfony, a molecule can be created from a SMILES string output = cdk.Outputfile("sdf", "similar
with: mols.sdf")

mol = toolkit.readstring("smi", SMI for mol in cdk.readfile("sdf", "input
LESstring) file.sdf"):

RDKit fp = mol.calcfp()

mol = Chem.MolFromSmiles(SMILESstring) if fp | targetfp >= 0.7:

OpenBabel output.write(mol)

mol = openbabel.OBMol() output.close()

obconversion = openbabel.OBConversion() Alternatively, we could just have made a single change to
the original script, by replacing the import statement from
obconversion.SetInFormat("smi") "import pybel" with "from cinfony import cdk as pybel".

obconversion.ReadString(mol, SMI Using Cinfony to combine toolkits
LESstring) Another goal of Cinfony is to make it easy to combine
toolkits in the same script. This allows the user to exploit
CDK the complementary capabilities of different toolkits
(Table 1). For example, let's suppose the user wants to (1)
builder = cdk.DefaultChemObject convert a SMILES string to 3D coordinates with OpenBa-
Builder.getInstance() bel, then (2) create a 2D depiction of that molecule with
the RDKit, next (3) calculate descriptors with the CDK,
sp = cdk.smiles.SmilesParser(builder) and finally (4) write out an SDF file containing the
descriptor values and the 3D coordinates. The full Python
mol = sp.parseSmiles(SMILESstring) script is only seven lines long:

The RDKit was designed with Python scripting in mind, from cinfony import rdkit, cdk, obabel
and of the three toolkits is the most concise. On the other
hand, OpenBabel uses a characteristically C++ approach. mol = obabel.readstring("smi", "CCC=O")
An empty molecule is created, and is passed to an OBCon-
version instance as a container for the molecule read from mol.make3D()
the SMILES string. The SmilesParser in the CDK requires
an instance of an object implementing the IChemObject- rdkit.Molecule(mol).draw(show = False,
Builder interface. filename = "aldehyde.png")

Another advantage of a common API is that a script writ- descs = cdk.Molecule(mol).calcdesc()
ten for one toolkit can easily be modified to use another.
As an example, here is a script that selects molecules that mol.data.update(descs)
are similar to a particular target molecule. This script is
taken from the original Pybel paper [7], but uses the CDK mol.write("sdf", filename = "alde
instead of OpenBabel and will run equally well from hyde.sdf")
Jython and CPython. The only differences compared to
the original script are that "pybel" has been replaced with For cheminformaticians interested in developing QSAR or
"cdk", and the import statement has been changed from QSPR models, Cinfony can be used to simultaneously cal-
"import pybel": culate descriptors from the RDKit, the CDK and OpenBa-
bel. For example, the following script reads a multiline
from cinfony import cdk input file, with each line consisting of a SMILES string fol-
lowed by a property value. For each molecule, it calculates
targetmol = cdk.readfile("sdf", "target all of the OpenBabel, RDKit and CDK descriptors (except
mol.sdf").next() for CDK's CPSA) and writes out the results as a tab-sepa-

Page 6 of 10


rated file suitable for reading with the statistical package R print >> outputfile, "t".join(["Prop
[15]. Note that in this example script, if descriptors share erty"] + descnames)
the same name only one is retained. This is the case for the
TPSA descriptor in OpenBabel, which is replaced by the for smile, propval, desc in zip(smiles,
RDKit's TPSA descriptor. propvals, descs):

import string descvals = [str(desc[descname]) for
descname in descnames]
from cinfony import obabel, cdk, rdkit
print >> outputfile, "t".join([smile,
# Read in SMILES strings and observed prop str(propval)] +
erty values
descvals)
smiles, propvals = [], []
outputfile.close()
for line in open("data.txt"):
Performance
broken = line.rstrip().split() Accessing cheminformatics libraries using Cinfony allows
the user to rapidly develop scripts that manipulate chem-
smiles.append(broken [0]) ical information. However, there is a small price to be
paid. Firstly, there is the cost of moving objects across the
propvals.append(float(broken)) interface between Python and the cheminformatics librar-
ies. Secondly, the additional code required by Cinfony to
mols = [obabel.readstring("smi", smile) implement a standard API may slow performance further.
for smile in smiles]
To assess the performance penalty for accessing chem-
# Calculate descriptor values using informatics toolkits using Cinfony rather than directly in
OpenBabel, the native language, we looked at two simple test cases:
(1) iterating over an SDF file containing 25419 molecules,
# the CDK (apart from 'CPSA') and the RDKit (2) iterating and printing out the molecular weight of
each of the molecules. The SDF file used was 3_p0.0.sdf,
cdkdescs = [x for x in cdk.descs if x != the first portion of the drug-like subset of the ZINC 7.00
'CPSA'] dataset [16]. The Cinfony scripts, Java and C++ source
code are available as Additional file 2. The results are
descs = [] shown in Table 3.

for mol in mols: While accessing the CDK using Jython is almost as fast as
a pure Java implementation, there is a considerable over-
d = mol.calcdesc() head associated with using JPype to access the CDK from
CPython (89% slower for the second test case). This over-
d.update(cdk.Molecule(mol).calcdesc(cd head is due to passing objects between the JVM and CPy-
kdescs)) thon. For OpenBabel, there is little performance cost
associated with accessing OpenBabel from either imple-
d.update(rdkit.Molecule(mol).calcdesc( mentation of Python, although the jybel scripts are some-
)) what slower than pybel scripts. A small portion of this
speed difference can be attributed to a slower startup
descs.append(d) (about 1.6 seconds for jybel, compared to 0.8 seconds for
pybel). Finally, from the RDKit results in Table 3, it is clear
# Write a file suitable for 'read.table' that using Boost.Python to wrap a C++ library is more effi-
in R cient than using SWIG. The difference in run times
between the C++ and Python implementations is negligi-
outputfile = open("inputforR.txt", "w") ble.

descnames = sorted(descs [0].keys(), key = In practice, the performance of a particular Cinfony script
string.lower) will depend on the extent to which information is passed

Page 7 of 10


Table 3: Performance of Cinfony modules compared to a native Java or C++ implementation.

Iterate over SDF Iterate and calculate molecular weight

CDK Time (s) Normalised Time (s) Normalised
Native Java 21.2 1.00 36.8 1.00
cdkjython 23.1 1.09 41.6 1.13
cdkjpype 33.0 1.57 69.5 1.89

OpenBabel
Native C++ 31.9 1.00 43.0 1.00
pybel 34.1 1.07 45.1 1.05
jybel 38.0 1.19 49.6 1.15

RDKit
Native C++ 99.7 1.00 100.7 1.00
rdkit 99.9 1.00 101.0 1.00

The times reported are wallclock times from the best of three runs on a dual-core Intel Pentium 4 3.2 GHz machine with 1GB RAM.

back and forth between Python and the underlying Java or ticomponent molecules. For each molecule, PubChem
C++ library. Where most of the time is spent on computa- provides an SDF file containing coordinates for a 2D
tion in the underlying library, the speed difference depiction, as well as the depiction itself as a PNG file.
between a native implementation and one using Cinfony PubChem uses the CACTVS toolkit [18] to generate the
is expected to be small. 2D coordinates as well as the corresponding depiction.
Using a script similar to the following, we used Cinfony to
Comparison of toolkits generate 2D depictions using OASA (the depiction library
Cinfony makes it easy to compare the results obtained by used by pybel), the CDK and a development version of
different toolkits for the same operations. This can be use- RDKit that all use the same 2D coordinates taken from the
ful in identifying bugs, applying a test suite, or finding the SDF file:
strengths and weaknesses of particular implementations.
For example, where different toolkits calculate the same from cinfony import pybel, rdkit
descriptors, if the calculated values are not highly corre-
lated it may indicate a bug in one or the other. Earlier, we for toolkit in [rdkit, pybel]:
mentioned that a difference in the treatment of implicit
hydrogens causes different toolkits to give different values name = toolkit.__name__
for molecular weight unless hydrogens are explicitly
added. Ensuring that a particular result is in agreement for mol in toolkit.readfile("sdf",
with that obtained by another toolkit can act as a sanity "dataset.sdf"):
check in such instances to avoid errors.
mol.draw(filename = "%s_%s.png" %
When carrying out the same operation with several (mol.title, name),
toolkits, it is often convenient to iterate over the toolkits
in an outer loop: show = False,

from cinfony import obabel, rdkit, cdk usecoords = True)

for toolkit in [obabel, rdkit, cdk]: When the resulting images were compared for the
PubChem entry CID7250053, an error was found in the
print toolkit.readstring("smi", depiction of the stereochemistry of an isopropyl group
"CCC").molwt (Figure 2). Since the error only occurred in certain cases, it
had not been previously noticed and would have been dif-
As an example of how such comparisons can be used to ficult to identify without such a comparative study. Once
identify bugs in toolkits, let us consider depiction. As a reported, the problem was quickly solved and the subse-
dataset, we randomly chose 100 molecules from quent RDKit release depicted the stereochemistry cor-
PubChem [17], with subsequent filtering to remove mul- rectly. A comparison of depictions by commercial toolkits

Page 8 of 10


Other requirements: OpenBabel, CDK, RDKit, Java,
OASA, JPype, Python Imaging Library

License: BSD


Competing interests
The authors declare that they have no competing interests.

Authors' contributions
NMOB conceived and developed Cinfony. GRH is the
lead developer of OpenBabel and created the Python and
Java SWIG bindings. All authors read and approved the
final manuscript.

Additional material

Additional file 1
Miniwebsite API. A mini-website of the Cinfony API documentation.
Click here for file
[http://www.biomedcentral.com/content/supplementary/1752-
Figure
different2toolkits
Comparison of depictions of PubChem CID7250053 using 153X-2-24-S1.zip]
Comparison of depictions of PubChem CID7250053
using different toolkits. The depiction using the develop- Additional file 2
ment version of RDKit showed incorrect stereochemistry Timing Code. A zip file containing Python, Java and C++ code used for
for the isopropyl substituent of the thiazole ring. run time comparisons for two test cases.
Click here for file
153X-2-24-S2.zip]
and depictions generated by Cinfony is available here (see
Additional file 3). Additional file 3
Miniwebsite Depictions. A mini-website showing a comparison of the
Conclusion depictions generated by several cheminformatics toolkits.
Cinfony makes it easy to combine complementary fea- Click here for file
tures of the three main Open Source cheminformatics
153X-2-24-S3.zip]
toolkits. By presenting a standard simplified API, the
learning curve associated with starting to use a new toolkit
is greatly reduced, thus encouraging users of one toolkit to
investigate the potential of others.
Acknowledgements
Cinfony would not be possible without the work of many Open Source
Cinfony is freely available from the Cinfony website [19], projects. In particular, we thank several developers who responded quickly
both as Python source code and as a Windows distribu- to bug reports or queries: Beda Kosata (OASA), Greg Landrum (RDKit),
tion containing dependencies. Installation instructions Tim Vandermeersch (OpenBabel), Steve Ménard (JPype). Thanks also to
are provided for MacOSX, Linux and Windows. Gilbert Mueller and Chris Morley for feedback on installing Cinfony.
NMOB thanks Google Code for providing free web hosting and develop-
ment tools for Cinfony. We thank the anonymous reviewers for several
Availability and requirements
useful suggestions.
Project name: Cinfony
References
Project home page: http://cinfony.googlecode.com 1. OpenBabel v.2.2.0 [http://openbabel.org]
2. Steinbeck C, Hoppe C, Kuhn S, Floris M, Guha R, Willighagen E:
Operating system(s): Platform independent Recent Developments of the Chemistry Development Kit
(CDK) – An Open-Source Java Library for Chemo- and Bio-
informatics. Curr Pharm Des 2006, 12:2110-2120.
Programming language: Python, Jython 3. Landrum G: RDKit. [http://www.rdkit.org].
4. Murray-Rust P, Rzepa HS: Chemical Markup, XML, and the
Worldwide Web. 1. Basic Principles. J Chem Inf Comput Sci 1999,
39:928-942.

Page 9 of 10


5. Apodaca R, O'Boyle N, Dalke A, Van Drie J, Ertl P, Hutchison G,
James CA, Landrum G, Morley C, Willighagen E, De Winter H:
OpenSMILES. [http://www.opensmiles.org].
6. Daylight Chemical Information Systems Manual [http://
www.daylight.com/dayhtml/doc/theory/theory.smiles.html]
7. O'Boyle NM, Morley C, Hutchison GR: Pybel: a Python wrapper
for the OpenBabel cheminformatics toolkit. Chem Cent J 2008,
2:5.
8. Kosata B: OASA. [http://bkchem.zirael.org/oasa_en.html].
9. Raymond ES: The Art of UNIX Programming 2003 [http://www.catb.org/
~esr/writings/taoup/index.html]. Reading, MA: Addison-Wesley
10. Symyx CTfile formats [http://www.mdli.com/downloads/public/
ctfile/ctfile.jsp]
11. KNIME – Konstanz Information Miner [http://knime.org]
12. SWIG v.1.3.36 [http://www.swig.org]
13. Ménard S: JPype. [http://jpype.sf.net].
14. Boost.Python [http://www.boost.org/libs/python/doc/]
15. R development core team: R: A language and environment for
statistical computing. [http://www.R-project.org].
16. Irwin JJ, Shoichet BK: ZINC – A Free Database of Commercially
Available Compounds for Virtual Screening. J Chem Inf Model
2005, 45:177-182.
17. PubChem [http://pubchem.ncbi.nlm.nih.gov/]
18. CACTVS Chemoinformatics Toolkit: Xemistry GmbH: Lah-
ntal, Germany. .
19. O'Boyle NM: Cinfony. [http://cinfony.googlecode.com].

Publish with ChemistryCentral and every
scientist can read your work free of charge
Open access provides opportunities to our
colleagues in other parts of the globe, by allowing
anyone to view the content free of charge.
W. Jeffery Hurst, The Hershey Company.
available free of charge to the entire scientific community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours you keep the copyright
Submit your manuscript here:
http://www.chemistrycentral.com/manuscript/

Page 10 of 10

O’Boyle et al. Journal of Cheminformatics 2011, 3:33
http://www.jcheminf.com/content/3/1/33

SOFTWARE Open Access

Open Babel: An open chemical toolbox
Noel M O’Boyle1, Michael Banck2, Craig A James3, Chris Morley4, Tim Vandermeersch4 and Geoffrey R Hutchison5*

Abstract
Background: A frequent problem in computational modeling is the interconversion of chemical structures
between different formats. While standard interchange formats exist (for example, Chemical Markup Language) and
de facto standards have arisen (for example, SMILES format), the need to interconvert formats is a continuing
problem due to the multitude of different application areas for chemistry data, differences in the data stored by
different formats (0D versus 3D, for example), and competition between software along with a lack of vendor-
neutral formats.
Results: We discuss, for the first time, Open Babel, an open-source chemical toolbox that speaks the many
languages of chemical data. Open Babel version 2.3 interconverts over 110 formats. The need to represent such a
wide variety of chemical and molecular data requires a library that implements a wide range of cheminformatics
algorithms, from partial charge assignment and aromaticity detection, to bond order perception and
canonicalization. We detail the implementation of Open Babel, describe key advances in the 2.3 release, and
outline a variety of uses both in terms of software products and scientific research, including applications far
beyond simple format interconversion.
Conclusions: Open Babel presents a solution to the proliferation of multiple chemical file formats. In addition, it
provides a variety of useful utilities from conformer searching and 2D depiction, to filtering, batch conversion, and
substructure and similarity searching. For developers, it can be used as a programming library to handle chemical
data in areas such as organic chemistry, drug design, materials science, and computational chemistry. It is freely
available under an open-source license from http://openbabel.org.

Introduction indication of biomolecular residues, or multiple
The history of chemical informatics has included a huge conformations.
variety of textual and computer representations of mole- While attempts have been made to provide a standard
cular data. Such representations focus on specific atomic format for storing chemical data, including most notably
or molecular information and may not attempt to store the development of Chemical Markup Language (CML)
all possible chemical data. For example, line notations [2-6], an XML dialect, such formats have not yet
like Daylight SMILES [1] do not offer coordinate infor- achieved widespread use. Consequently, a frequent pro-
mation, while crystallographic or quantum mechanical blem in computational modeling is the interconversion
formats frequently do not store chemical bonding data. of molecular structures between different formats, a pro-
Hydrogen atoms are frequently omitted from x-ray crys- cess that involves extraction and interpretation of their
tallography due to the difficulty in establishing coordi- chemical data and semantics.
nates, and are often ignored by some file formats as the We outline for the first time, the development and use
“implicit valence” of heavy atoms that indicates their of the Open Babel project, a full-featured open chemical
presence. Other types of representations require specifi- toolbox, designed to “speak” the many different repre-
cation of atom types on the basis of a specific valence sentations of chemical data. It allows anyone to search,
bond model, inclusion of computed partial charges, convert, analyze, or store data from molecular modeling,
chemistry, solid-state materials, biochemistry, or related
areas. It provides both ready-to-use programs as well as
* Correspondence: geoffh@pitt.edu
5
University of Pittsburgh, Department of Chemistry, 219 Parkman Avenue, a complete, extensible programmer’s toolkit for develop-
Pittsburgh, PA 15217, USA ing cheminformatics software. It can handle reading,
Full list of author information is available at the end of the article

© 2011 O’Boyle et al; licensee Chemistry Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.

J. Cheminf. 2011, 3, 33.

O’Boyle et al. Journal of Cheminformatics 2011, 3:33 Page 2 of 14

writing, and interconverting over 110 chemical file for- substructure searching (see below); the MolPrint2D and
mats, supports filtering and searching molecule files Multilevel Neighborhoods of Atoms formats calculate cir-
using Daylight SMARTS pattern matching [7] and other cular fingerprints defined by Bender et al. [15,16] and
methods, and provides extensible fingerprinting and Filimonov et al. [17,18] respectively.
molecular mechanics frameworks. We will discuss the Each format can have multiple options to control
frameworks for file format interconversion, fingerprint- either reading or writing a particular format. For exam-
ing, fast molecular searching, bond perception and atom ple, the InChI format has 12 options including an
typing, canonical numbering of molecular structures and option “K” to generate an InChIKey, “T <param>“ to
fragments, molecular mechanics force fields, and the truncate the InChI depending on a supplied parameter
extensible interfaces provided by the software library to and “w” to ignore certain InChI warnings. The available
enable further chemistry software development. options are listed in the documentation, are shown in
Open Babel has its origin in a version of OELib the Graphical User Interface (GUI) as checkboxes or
released as open-source software by OpenEye Scientific textboxes, and can be listed at the command-line. In
under the GPL (GNU Public License). In 2001, OpenEye fact, all three are generated from the same source; a
decided to rewrite OELib in-house as the proprietary documentation string in the C++ code.
OEChem library, so the existing code from OELib was
spun out into the new Open Babel project. Since 2001, Fingerprints and Fast Searching
Open Babel has been developed and substantially Databases are widely used to store chemical information
extended as an international collaborative project using especially in the pharmaceutical industry. A key require-
an open-source development model [8]. It has over ment of such a database is the ability to index chemical
160,000 downloads, over 400 citations [9], is used by structures so that they can be quickly retrieved given a
over 40 software projects [10], and is freely available query substructure. Open Babel provides this functional-
from the Open Babel website [11]. ity using a path-based fingerprint. This fingerprint,
referred to as FP2 in Open Babel, identifies all linear
Features and ring substructures in the molecule of lengths 1 to 7
File Format Support (excluding the 1-atom substructures C and N) and maps
With the release of Open Babel 2.3, Open Babel sup- them onto a bit-string of length 1024 using a hash func-
ports 111 chemical file formats in total. It can read 82 tion. If a query molecule is a substructure of a target
formats and write 85 formats. These encompass com- molecule, then all of the bits set in the query molecule
mon formats used in cheminformatics (SMILES, InChI, will also be set in the target molecule. The fingerprints
MOL, MOL2), input and output files from a variety of for two molecules can also be used to calculate struc-
computational chemistry packages (GAMESS, Gaussian, tural similarity using the Tanimoto coefficient, the num-
MOPAC), crystallographic file formats (CIF, ShelX), ber of bits in common divided by the union of the bits
reaction formats (MDL RXN), file formats used by set.
molecular dynamics and docking packages (AutoDock, Clearly, repeated searching of the same set of mole-
Amber), formats used by 2D drawing packages (Chem- cules will involve repeated use of the same set of finger-
Draw), 3D viewers (Chem3D, Molden) and chemical prints. To avoid the need to recalculate the fingerprints
kinetics and thermodynamics (ChemKin, Thermo). For- for a particular multi-molecule file (such as an SDF file),
mats are implemented as “plugins” in Open Babel, Open Babel provides a fastindex format that solely
which makes it easy for users to contribute new file for- stores a fingerprint along with an index into the original
mats (see Extensible Interface below). Depending on the file. This index leads to a rapid increase in the speed of
format, other data is extracted by Open Babel in addi- searching for matches to a query - datasets with several
tion to the molecular structure; for example, vibrational million molecules are easily searched interactively. In
frequencies are extracted from computational chemistry this way, a multi-molecule file may be used as a light-
log files, unit cell information is extracted from CIF weight alternative to a chemical database system.
files, and property fields are read from SDF files.
A number of “utility” file formats are also defined; Bond Perception and Atom Typing
these are not strictly speaking a way of storing the As mentioned above, many chemical file formats offer
molecular structure, but rather present certain function- representations of molecular data solely as lists of
ality through the same interface as the regular file for- atoms. For example, most quantum chemical software
mats. For example, the report format is a write-only packages and most crystallographic file formats do not
utility format [12] that presents a summary of the mole- offer definitions of bonding. A similar situation occurs
cular structure of a molecule; the fingerprint format [13] in the case of the Protein Data Bank (PDB) format;
and fastsearch format [14] are used for similarity and while standardized [19] files contain connectivity

J. Cheminf. 2011, 3, 33.


information, non-standard files exist that often do not determined, an exhaustive search is performed to assign
provide full connectivity information. Consequently, single and double bonds to satisfy all valences in a
Open Babel features methods to determine bond con- Kekulé form. Since this process is exponential in com-
nectivity, bond order perception, aromaticity determina- plexity, the algorithm will terminate if more than 30
tion, and atom typing. levels of recursion or 15 seconds are exceeded (which
Bond connectivity is determined by the frequently may occur in the case of large fused ring systems such
used algorithm of detecting atoms closer than the sum as carbon nanotubes).
of their covalent radii, with a slight tolerance (0.45 Å) to
allow for longer than typical bonds. To handle disorder Canonical Representation of Molecules
in crystallographic data (e.g., PDB or CIF files), atoms In general, for any particular molecular structure and
closer than 0.63 Å are not bonded. A further filtering file format, there are a large number of possible ways
pass is made to ensure standard bond valency is main- the structure could be stored; for example, there are N!
tained; each element has a maximum number of bonds, ways of ordering the atoms in an MOL file. While each
if this is exceeded then the longest bonds to an atom of the orderings encodes exactly the same information,
are successively removed until the valence rule is it can be useful to define a canonical numbering of the
fulfilled. atoms of a molecule and use this to derive a canonical
After bond connectivity is determined, if needed or representation of a molecule for a particular file format.
requested by the user, bond order perception is per- For a zero-dimensional file format without coordinates,
formed on the basis of bond angles and geometries. The such as SMILES, the canonical representation could be
method is similar to that proposed by Roger Sayle [20] used to index a database, remove duplicates or search
and uses the average bond angle around an un-typed for matches.
atom to determine sp and sp 2 hybridized centers. 5- Open Babel implements a sophisticated canonicaliza-
membered and 6-membered rings are checked for pla- tion algorithm that can handle molecules or molecular
narity to estimate aromaticity. Finally, atoms marked as fragments. The atom symmetry classes are the initial
unsaturated are checked for an unsaturated neighbor to graph invariants and encode topological and chemical
give a double or triple bond. After this initial atom typ- properties. A cooperative labeling procedure is used to
ing, known functional groups are matched, followed by investigate the automorphic permutations to find the
aromatic rings, followed by remaining unsatisfied bonds canonical code. Although the algorithm is similar to the
based on a set of heuristics for short bonds, atomic elec- original Morgan canonical code [21], various improve-
tronegativity, and ring membership. ments are implemented to improve performance. Most
Atom typing is performed by “lazy evaluation,” match- notably, the algorithm implements heuristics from the
ing atoms against SMARTS patterns to determine hybri- popular nauty package [22,23]. Another aspect handled
dization, implicit valence, and external atom types. by the canonical code is stereochemistry as different
Atom type perception may be triggered by adding labelings can lead to different parities. This is further
hydrogens (which requires determination of implicit and complicated by the possibility of symmetry-equivalent
explicit valence), exporting to a file format that requires stereocenters and stereocenters whose configuration is
atom types, or as requested by the user. To minimize interdependent. The full details will be the subject of a
the amount of typing required, when importing from a separate publication.
format with atom types specified, a lookup table is used
to translate between equivalent types. Coordinate Generation in 2D and 3D
An important part of atom typing is aromaticity detec- Open Babel, version 2.3, has support for 2D coordinate
tion and assignment of Kekulé bond orders (kekuliza- generation (Figure 1) through the donation of code by
tion). In Open Babel, a central aromaticity model is Sergei Trepalin, based on the code used in the MCDL
used, largely matching the commonly used Daylight chemical structure editor [24-26]. The MCDL algorithm
SMILES representation [1], but with added support for aims to layout the molecular structure in 2D such that
aromatic phosphorous and selenium. Potential aromatic all bond lengths are equal and all bond angles are close
atoms and bonds are flagged on the basis of member- to 120°. The layout algorithm includes a small database
ship in a ring system possibly containing 4n+2 π elec- of around 150 templates to help layout cages and large
trons. Aromaticity is established only if a well-defined fragment cycles. To deal with the problem of overlap-
valence bond Kekulé pattern can be determined. To do ping fragments, the algorithm includes an exhaustive
this, atoms are added to a ring system and checked search procedure that rotates around acyclic bonds by
against the 4n+2 π electron configuration, gradually 180°.
increasing the size to establish the largest possible con- Coordinate generation in 3D was introduced in Open
nected aromatic ring system. Once this ring system is Babel version 2.2, and improved in version 2.3, to enable

J. Cheminf. 2011, 3, 33.


tetrahedral stereochemistry and square-planar stereo-
chemistry (this last is still under development), as well
as perception routines for 2D and 3D geometries, and
routines to query and alter the stereochemistry.
The detection of stereogenic units starts with an ana-
lysis of the graph symmetry of the molecule to identify
the symmetry class of each atom. However, given that a
complete symmetry analysis also needs to take stereo-
chemistry into account, this means that the overall
stereochemistry can only be found iteratively. At each
iteration, the current atom symmetry classes are used to
identify stereogenic units. For example, a tetrahedral
Figure 1 Interconversion of 0D, 2D and 3D structures. The
structures shown are of sertraline, a selective serotonin reuptake
center is identified as chiral if it has four neighbors with
inhibitor (SSRI) used in the treatment of depression. A SMILES string different symmetry classes (or three, in the case where a
for sertraline is shown at the top; this can be considered a 0D lone pair gives rise to the tetrahedral shape).
structure (only connectivity and stereochemical information). From
this, Open Babel can generate a 2D structure (bottom left, depicted Forcefields
by Open Babel) or a 3D structure (bottom right, depicted by
Avogadro), and all of these can be interconverted. Molecular mechanics functions are provided for use
with small molecules. Typical applications include
energy evaluation or minimization, alone or as part of a
conversion from 0D formats such as SMILES to 3D for- larger workflow. The selection of implemented force
mats such as SDF (Figure 1). The 3D structure genera- fields allows most molecular structures to be used and
tor builds linear components from scratch following parameters to be assigned automatically. The MMFF94
geometrical rules based on the hybridization of the (s) force field can be used for organic or drug-like mole-
atoms. Single-conformer ring templates are used for cules [27-31]. For molecules containing any element of
ring systems. The template matching algorithm iterates the periodic table or complex geometry (i.e. not sup-
through the templates from largest to smallest searching ported by MMFF94), the UFF force field can be used
for matches. If a match is found, the algorithm con- instead [33]. Recently, code implementing the GAFF
tinues but will not match any ring atoms previously force field [34,35] was also contributed and released as
templated except in the case of a single overlap (the two part of version 2.3. All of the forcefields allow the appli-
ring systems of a spiro group) or an overlap involving cation of constraints on particular atom positions, or
exactly two adjacent atoms (two fused ring systems). particular distances.
After an initial structure is generated, the stereochemis- Several conformer searching methods have been
try (cis/trans and tetrahedral) is corrected to match the implemented using the forcefields, all based on the “tor-
input structure. Finally, the energy of the structure is sion-driving” approach. This approach involves setting
minimized using the MMFF94 forcefield [27-31] and a torsion angles from a set of predefined allowed values
low energy conformer found using a weighted rotor for a particular rotatable bond. The most thorough
search. search method implemented is a systematic search
While the 3D structure builder produces reasonable method, which iterates over all of the allowed torsion
conformations for molecules without rings or with ring angles for each rotatable bond in the molecule and
systems for which a template exists, the results may be retains the conformer with the lowest energy. Since a
poor for molecules with more complex ring systems or systematic search may not be feasible for a molecule
organometallic species. Future work will be performed with multiple rotatable bonds, a number of stochastic
to compare the results of Open Babel with other pro- search methods are also available: the random search
grams with respect to both speed and the quality of the method, which tries random settings for the torsion
generated structures [32]. angles (from the predefined allowed values), and a
weighted rotor search, a stochastic search method that
Stereochemistry converges on a low energy conformer by weighting par-
A recent focus of Open Babel development has been to ticular torsion angles based on the relative energy of the
ensure robust translation of stereochemical information generated conformer. With Open Babel 2.3, conformer
between file formats. This is particularly important search based on a genetic algorithm is also available
when dealing with 0D formats as these explicitly encode which allows the application of filters (e.g. a diversity fil-
the perceived stereochemistry. Open Babel 2.3 includes ter) and different scoring functions. This latter method
classes to handle cis/trans double bond stereochemistry, can be used to generate a library of diverse conformers,

J. Cheminf. 2011, 3, 33.


or like the other methods to seek a low energy confor-
mer [36].

Implementation
Technical Details
Open Babel is implemented in standards-compliant C+
+. This ensures support for a wide variety of C++ com-
pilers (MSVC, GCC, Intel Compiler, MinGW, Clang),
operating systems (Windows, Mac OS X, Linux, BSD,
Windows/Cygwin) and platforms (32-bit, 64-bit). Since
version 2.3, it is compiled using the CMake build system
[37,38]. This is an open-source cross-platform build sys-
tem with advanced features for dependency analysis.
The build system has an associated unit test framework
CTest, which allows nightly builds to be compiled and
tested automatically with the results collated and dis-
played on a centralized dashboard [39].
Figure 2 Architecture of the Open Babel codebase.
To simplify installation Open Babel has as few exter-
nal dependencies as possible. Where such dependencies
exist, they are optional. For example, if the XML devel- The code base can be considered as consisting of the
opment libraries are not available, Open Babel will still following modules (Figure 2):
compile successfully but none of the XML formats
(such as Chemical Markup Language, CML) will be • The Chemical Core, which contains OBMol etc.
available. Similarly, if the Eigen matrix and linear alge- and has all of the chemical structure description and
bra library is not found, any classes that require fast manipulation. This is the heart of the application
matrix manipulation (such as OBAlign, which performs and its API can be used as a chemical toolbox. It
least squares alignment) will not be compiled. has no input/output capabilities.
While the majority of the Open Babel library is writ- • The Formats, which read and write to files of dif-
ten in C++, bindings have been developed for a range of ferent types. These classes are derived from a com-
other programming languages, including Java and the . mon base class, OBFormat, which is in the
NET platform, as well as the so-called “dynamic” script- Conversion Control module. They also make use of
ing languages Perl, Python, and Ruby. These are auto- the chemical routines in the Chemical Core module.
matically generated from the C++ header files using the Each format file contains a global object of the for-
SWIG tool. As described previously [40], in the case of mat class. When the format is loaded the class con-
Python an additional module is provided named Pybel structor registers the presence of the class with
that simplifies access to the C++ bindings. These inter- OBConversion. This means that the formats are plu-
faces facilitate development of web-enabled chemistry gins - new formats can be added without changing
applications, as well as rapid development and any framework code.
prototyping. • Common Formats include OBMoleculeFormat and
XMLBaseFormat from which most other formats
Code Architecture (like Format A and Format B in the diagram) are
The Open Babel codebase has a modular design as derived. Independent formats like Format C are also
shown in Figure 2. The goal of this design is threefold: possible.
• The Conversion Control, which also keeps track of
1. To separate the chemistry, the conversion process the available formats, the conversion options and the
and the user interfaces reducing, as far as possible, input and output streams. It can be compiled with-
the dependency of one upon another. out reference to any other parts of the program. In
2. To put all of the code for each chemical format in particular, it knows nothing of the Chemical Core:
one place (usually a single file) and make the addi- mol.h is not included.
tion of new formats simple. • The User Interface, which may be a command line
3. To allow the format conversion of not just mole- application, a Graphical User Interface (GUI), or
cules, but also any other chemical objects, such as may be part of another program that uses Open
reactions. Babel’s input and output facilities. This depends only

J. Cheminf. 2011, 3, 33.


on the Conversion Control module (obconversion.h citation in science. The rights granted by open source
is included), but not on the Chemical Core or on licenses largely coincide with the norms of scientific
any of the Formats. ethics to enable verifiability, repeatability, and building
• The Fingerprint API, as well as being usable in on previous results and theories.
external programs, is employed by the fastsearch and Beyond these rights, Open Babel (like most other
fingerprint formats. open-source projects) offers open development – that is,
• The Fingerprints, which are bit arrays that describe all development occurs in public forums and with public
an object and which facilitate fast searching. They code repositories. This results in greater input from the
are also built as plugins, registering themselves with community as any user can easily submit bug reports or
their base class OBFingerprint which is in the Fin- feature suggestions, get involved in discussions on the
gerprint API. future direction of Open Babel or even become a devel-
• Other features such as Forcefields, Partial Charge oper him/herself. In practice, the number of active con-
Models and Chemical Descriptors, although not tributors has increased over time through this level of
shown in the diagram, are handled similarly to open, public development (Figure 3). Moreover, it
Fingerprints. means that the development of the code is completely
• The Error Handling can be used throughout the transparent and the quality of the software is available
program to log and display errors and warnings. for public scrutiny. Indeed, since its inception, over 658
bugs have been submitted to the public tracker and
fixed [41].
Extensible Interface
The utility of software libraries such as Open Babel Validation and Testing
depends on the ability of the design to be extended over Open Babel includes an extensive test suite comprising
time to support new functionality. To facilitate this, 60 different test programs each with tens to hundreds of
Open Babel implements a plugin interface for file for- tests. In early 2010, a nightly build infrastructure and
mats, fingerprints, charge models, descriptors, “opera- dashboard was put in place with support from Kitware,
tors” and molecular mechanics force fields. This ensures Inc. This has greatly improved code quality by catching
a clean separation of the implementation of a particular regressions, and also ensures that the code compiles
plugin from the core Open Babel library code, and cleanly on all platforms and compilers supported by
makes it easy for a new plugin (e.g. a new file format) to Open Babel. Some examples of tests that are run each
be contributed; all that is needed is a single C++ file and night are:
a trivial change to one of the build files. The operator
plugins provide a very general mechanism for operating (1) The MMFF94 forcefield code is tested against the
on a molecule (e.g. energy minimization or 3D coordi- MMFF94 validation suite.
nate generation) or on a list of molecules (e.g. filtering
or sorting) after reading but before writing.
Plugins are dynamically loaded at runtime. This
decreases the overall disk and memory footprint of
Open Babel, allowing external developers to choose par-
ticular functionality needed for their application and
ignore other, less relevant features. It also allows the
possibility of a third-party distributing plugins separately
to the Open Babel distribution to provide additional
functionality.

Open-Source License and Open Development
Open Babel is open-source software, which offers end
users and third-party developers a range of additional
rights not granted by proprietary chemistry software.
Open-source software, at its most basic level, grants
users the rights to study how their software works, to
adapt it for any purpose or otherwise modify it, and to Figure 3 Number of contributors over time. Note that this graph
share the software and their modifications with others. only includes developers who directly commited code to the Open
In this sense, Open Source functions in similar ways to Babel source code repository, and does not include patches
provided by users.
the processes of open peer review, publication, and

J. Cheminf. 2011, 3, 33.


(2) The OBAlign class, which was developed using
Test-Driven Development (TDD) methodology, is
run against its test suite.
(3) Handling of symmetry is validated by converting
several test cases between SMILES, 2D and 3D SDF,
and InChI (there are also several test programs with
unit tests for the individual stereo classes in the
API).
(4) The SMARTS parser is tested using over 250 Figure 4 The two failures found in the validation test for
reading/writing SMILES.
valid and invalid SMARTS patterns, and the
SMARTS matcher is tested using 125 basic
SMARTS patterns.
(5) The LSSR (Least Set of Smallest Rings) code is meso compound and so both SMILES strings are cor-
tested for invariance against changing the atom rect and represent the same molecule. However the
order for a series of polycyclic molecules. canonicalization algorithm should have chosen one
stereochemistry or the other for the canonical
Recently the development team has placed a major representation.
focus on increasing the robustness of file format transla- Another area of focus was the canonicalization algo-
tion particularly in relation to the commonly used rithm, which can be used to generate canonical SMILES
SMILES and MDL Molfile formats. Translating between as well as other formats. The algorithm can be tested by
these formats requires accurate stereochemistry percep- ensuring that the same canonical SMILES string is
tion, inference of implicit hydrogens, and kekulization of obtained even when the order of atoms in a molecule is
delocalized systems. While it is difficult to ensure that changed (while retaining the same connection table).
any complex piece of code is free of bugs, and Open The test stresses all areas of the library, including aro-
Babel is no exception, validation procedures can be car- maticity perception, kekulization, stereochemistry, and
ried out to assess the current level of performance and canonicalization. The development of the canonicaliza-
to find additional test cases that expose bugs. The fol- tion code in Open Babel was guided by applying this
lowing procedure was used to guide the rewriting of test to the 5,151,179 molecules in the eMolecules catalo-
stereochemistry code in Open Babel, a project that gue (dated 2011-01-02) with 10 random shuffles of the
began in early 2009. Starting with a dataset of 18,084 atom order. At the time of the Open Babel 2.2.3 release,
3D structures from PubChem3D as an SDF file, we there were 24,404 failures of the canonicalization algo-
compared the result of (a) conversion to SMILES, fol- rithm; this has now been reduced to only four (shown
lowed by conversion of that to Canonical SMILES to (b) in Figure 5, < 0.001%). The Open Babel nightly test
conversion directly to Canonical SMILES. This proce- suite ensures that this test passes for a number of pro-
dure can be used to flush out errors in reading the ori- blematic molecules. Although the canonicalization algo-
ginal SDF file, reading/writing SMILES (either due to rithm is still not perfect, we believe that the current
stereochemistry errors or kekulization problems), and is level of performance (99.99992% success on the eMole-
also a test (to some extent) of the canonicalization code. cules catalogue) is acceptable for general use and with
At the time of starting this work (March 2009), the time we intend to improve performance further.
error rate found was 1424 (8%); by Oct 2009, combined Given that the error rate for canonicalization and
work on stereochemistry, kekulization and canonicaliza- handling of stereochemistry is now quite low, the next
tion had reduced this to 190 (~1%), and continued area of focus for the Open Babel development team is
improvements have reduced the number of errors down to improve the handling of implicit valence for “unusual
to two (shown in Figure 4) for Open Babel 2.3.1 atoms.” This is particularly important for organometallic
(~0.01%). The first failure is due to a kekulization error species and inorganic complexes.
in a polycyclic aromatic molecule incorporating heteroa-
toms: (a) gave c1ccc2c(c1)c1[nH][nH]c3c4c1c(c2) Using Open Babel
ccc4cc1c3cccc1 while (b) gave c1ccc2c(c1)c1nnc3c4c1c Applications
(c2)ccc4cc1c3cccc1. This error led to confusion over The Open Babel package is composed of a set of user
whether or not the aromatic nitrogens have hydrogens applications as well as a programming library. The main
attached (they do not). The second failure involves con- command line application provided is obabel (a small
fusion over the canonical stereochemistry at a bridge- upgrade on the earlier babel), which facilitates file for-
head carbon: (a) gave C1CN2[C@@H](C1)CCC2 while mat conversion, filtering (by SMARTS, title, descriptor
(b) gave C1CN2[C@H](C1)CCC2. This is actually a value, or property field), 3D or 2D structure generation,

J. Cheminf. 2011, 3, 33.


Figure 5 The four failures found in the validation test for canonicalization.

conversion of hydrogens from implicit to explicit (and use in programs. Documentation on the complete API
vice versa), and removal of small fragments or of dupli- (generated using Doxygen [42]) is available from the
cate structures. A number of features are provided to Open Babel website [43], or can be generated from the
handle multi-molecule file formats (such as SDF or source code.
MOL2) and to use or manipulate the information in The functionality provided by the Open Babel library
property fields and molecule titles. Here is an example is relied upon by many users and by several other soft-
of using obabel to convert from SDF format to SMILES: ware projects, with the result that introducing changes
obabel inputmols.sdf -O outputmols.smi to the API would cause existing software to break. For
A more complicated use would be to extract all mole- this reason, Open Babel strives to maintain API stabi-
cules in an SDF file whose titles start with “active": lity over long periods of time, so that existing software
obabel inputmols.sdf -aT -o copy -O out- will continue to work despite the release of new Open
putmols.sdf –filter “title=’active*’” Babel versions with additional features, file formats
The copy format specified by “-o copy” is a utility for- and bug fixes. Open Babel uses a version numbering
mat that copies the exact contents of the input file (for system that indicates how the API has changed with
the filtered molecules) directly to the output, without every release:
perception or interpretation. The “-aT” indicates that
only the title of the input SDF file should be read; full • Bug fix releases (e.g. 2.0.0 versus 2.0.1) do not
chemical perception is not required. change API at all
The Open Babel graphical user interface (GUI) pro- • Minor version releases (e.g. 2.0 versus 2.1) will add
vides the same functionality. Figure 6 is a screenshot of to the API, but will otherwise be backwards-
the GUI carrying out the same filtering operation compatible
described in the obabel example above. The left panel • Major version releases (e.g. 2 versus 3) are not
deals with setting up the input file, the right panel han- backwards-compatible, and have changes to the API
dles the output and the central panel is for setting con- (including removal of deprecated classes and
version options. Depending on whether a particular functions)
option requires a parameter, the available options are
displayed either as check boxes or as text entry boxes. Figure 7 shows an example C++ program that uses the
These interface elements are generated dynamically two main classes OBConversion and OBMol to print
directly from the text description and help text provided out the molecular weight of all of the molecules in an
by each format plugin. SDF file. This could be used, for example, to investigate
differences in the molecular weight distribution between
Programming Library two databases. The same program is shown in Figure 8
The Open Babel library allows users to write chemistry but implemented using the Python bindings.
applications without worrying about the low-level details
of handling chemical information, such as how to read Examples of Use
or write a particular file format, or how to use SMARTS Open Babel has already been referenced over 400 times
for substructure searching. Instead, the user can focus for various uses. The most common use of Open Babel
on the scientific problem at hand, or on creating a more is through the obabel command line application (or the
easy-to-use interface (e.g. a GUI) to some of Open corresponding graphical user interface) for the intercon-
Babel’s functionality. The Open Babel API (Application version of chemical file formats. Such conversions may
Programming Interface) is the set of classes, methods also involve the calculation or inference of additional
and variables provided by Open Babel to the user for molecular information or application of a filter. Some

J. Cheminf. 2011, 3, 33.


Figure 6 Screenshot of the Open Babel GUI. In the screenshot, the Open Babel GUI is running on Bio-Linux 6.0, an Ubuntu derivative.

published examples of these include the following: • calculation of partial charges [54,55]
• generation of molecular fingerprints [56-59]
• interconversion of chemical file formats or repre- • removal of duplicate molecules from a dataset [60]
sentations [44-47] • calculation of MOL2 atom types [61]
• addition of hydrogens [48-50]
• generation of 3D molecular structures [51-53] An interesting example that shows how a particular
chemical representation may be used to facilitate a
scientific study is the crystallographic study of Fábián
and Brock who used Open Babel to generate InChI
strings for molecules in the Cambridge Structural Data-
base [62]. Exploiting the fact that InChIs of enantiomers
are identical expect at the enantiomer sublayer ("/m0”

Figure 7 Example C++ program that uses the Open Babel Figure 8 Example Python program that uses the Open Babel
library. The program prints out the molecular weight of each library. The program prints out the molecular weight of each
molecule in the SDF file “dataset.sdf”. molecule in the SDF file “dataset.sdf”.

J. Cheminf. 2011, 3, 33.


Table 1 Software applications and libraries that use Open Babel
Name Description Reference Web page
Avogadro GUI for molecular modelling and computational chemistry G. Hutchison http://avogadro.openmolecules.net/
M. Hanwell
cclib Parse computational chemistry output files [72] http://cclib.sf.net/
CCP1GUI GUI for computational chemistry Jens Thomas http://www.cse.scitech.ac.uk/ccg/software/
ccp1gui
ChemAzTech Manage a chemical laboratory database Rémy Dernat http://chemaztech.sf.net/
ChemSpotlight Chemistry file indexer for MacOSX G. Hutchison http://chemspotlight.openmolecules.net/
ChemT GUI for generating combinatorial libraries Rui Abreu http://www.esa.ipb.pt/~ruiabreu/chemt
ChemTool 2D molecular drawing [73] http://ruby.chemie.uni-freiburg.de/~martin/
chemtool
CMDF Library for handling and preparing multi-scale multi-paradigm [74] http://web.mit.edu/mbuehler/www/research/
simulations CMDF/CMDF.htm
Confab Systematically generate conformers [36] http://confab.googlecode.com/
DockoMatic Automate the preparation and analysis of AutoDock runs [75] http://sf.net/projects/dockomatic/
DOVIS 2.0 Automate the preparation and analysis of AutoDock runs [76] http://www.bhsai.org/dovis.html
FAF-Drugs2 ADMET filtering of molecular datasets [77] http://www.mti.univ-paris-diderot.fr/fr/
downloads.html
FMiner2 Large-scale chemical graph mining based on backbone [78,79] http://www.maunz.de/wordpress/bbrc
refinement classes
Ghemical GUI for computational chemistry Tommi http://www.uku.fi/~thassine/projects/
Hassinen ghemical
Gnome 2D chemical editor, 3D viewer, chemical calculator and periodic Jean Bréfort http://gchemutils.nongnu.org/
Chemistry Utils table for Linux
iBabel MacOSX interface to Open Babel and other Open chemistry tools Chris Swain http://homepage.mac.com/swain/Sites/
Macinchem/page65/ibabel3.html
Kalzium GUI showing information on the periodic table of the elements Carsten http://edu.kde.org/kalzium/
Niehaus
Lazar Lazy Structure-Activity Relationships for toxicity prediction [80] http://www.in-silico.de/software/
Molekel GUI for computational chemistry Ugo Varetto http://molekel.cscs.ch/
molsKetch 2D chemical editor Harm van http://molsketch.sf.net/
Eersel
MyChem Chemistry extension to the MySQL database J. Pansanel http://mychem.sf.net/
NanoEngineer- Computer-aided design for the nanoscale Nanorex, Inc. http://nanoengineer-1.net/
1
NanoHive-1 Simulator for the study, experimentation, and development of Brian Helfrich http://www.nanohive-1.org/
nanotech entities
OpenMD Open Source molecular dynamics engine [81] http://openmd.net/
Open3DQSAR High-throughput [82,83] http://www.open3dqsar.org/
chemometric analysis of molecular interaction fields
OSRA Extracts chemical structures from images [84] http://osra.sf.net/
PgChem Chemistry extension to the PostgreSQL database Ernst-Georg http://pgfoundry.org/projects/pgchem
Schmidt
Pharao Pharmacophore discovery and searching Silicos NV http://www.silicos.be/
Pharmer Pharmacophore searching [85] http://smoothdock.ccbb.pitt.edu/pharmer
Piramid Shape-based alignment of molecules Silicos NV http://www.silicos.be/
PyADF Library for handling and preparing quantum mechanical multi- [86] http://www.ipc.kit.edu/cfn-ysg/158.php
scale simulations
PyRx GUI for virtual screening with protein-ligand docking Sargis http://pyrx.scripps.edu/
Dallakyan
QMForge GUI for analysing results of quantum chemistry calculations [72] http://qmforge.sf.net/
RMG Reaction Mechanism Generator [87] http://rmg.sf.net/
Sci3D Interactive visualization of 3D models of scientific data, such as T.J. O’Donnell http://sci3d.sf.net/
molecular structures and surfaces
Sieve Filter molecules from datasets Silicos NV http://www.silicos.be/
SMIREP Generation of fragment-based structure-activity relationships [88] http://www.karwath.org/systems/smirep.html
Stripper Extract molecular scaffolds Silicos NV http://www.silicos.be/

J. Cheminf. 2011, 3, 33.


Table 1 Software applications and libraries that use Open Babel (Continued)
Toxtree Toxic hazard estimation using decision trees Ideaconsult http://toxtree.sf.net/
Ltd.
V_Sim Visualize atomic structures such as crystals and grain boundaries Damien Caliste http://inac.cea.fr/L_Sim/V_Sim/index.en.html
WebBabel Web application for file format conversion T.J. O’Donnell http://webbabel.sf.net/
XDrawChem 2D molecular editor Bryan Herger http://xdrawchem.sf.net/
XtalOpt Extension to Avogadro for crystal-structure prediction [89] http://xtalopt.openmolecules.net/
YASARA GUI for molecular graphics, modeling and simulation Elmar Krieger http://www.yasara.org/
ZODIAC GUI for molecular modelling and docking [90] http://www.zeden.org/

or “/m1”), they used the InChIs as part of a workflow to • Langham and Jain developed a model for chemical
identify kryptoracemates (a class of racemic crystals mutagenicity based on atom pair features [64].
where the enantiomers are not related by space-group • Fontaine et al. implemented a method, anchor-
symmetry) in the database. GRIND, that uses an anchor point of a molecular
To implement new methods, or access additional mole- scaffold to compare molecular interaction fields
cular information, it is necessary to use the Open Babel when different substituents are present [65].
library directly either from C++ or using one of the sup- • Konyk et al. have developed a plugin for Open
ported language bindings. Some examples of published Babel that adds support for the Web Ontology Lan-
studies that have done this include the following: guage (OWL) to allow automated reasoning about
chemical structures [66].
• Dehmer et al. implemented molecular complexity • Kogej et al. (AstraZeneca) implemented a 3-point
measures based on information theory [63]. pharmacophore fingerprint called TRUST [67].

Table 2 Web applications and databases that use Open Babel
Name Description Reference Web page
ChemDB Database of small molecules [91] http://cdb.ics.uci.edu/
Cheméo Chemical structure and property search engine Céondo Ltd http://www.chemeo.com/
ChemMine Web application for analysing and clustering small molecules [92] http://chemmine.ucr.edu/
Tools
eMolecules Chemical vendor search engine eMolecules. http://emolecules.com/
com
FragmentStore Database for comparison of fragments found in metabolites, drugs [93] http://bioinf-applied.charite.de/
and toxic compounds fragment_store/
Frog2 FRee Online druG 3D conformation generation [94] http://bioserv.rpbs.univ-paris-diderot.fr/cgi-
bin/Frog2
hBar Lab Web application providing on-demand access to computer-aided hBar Solutions https://www.hbar-lab.com/
chemistry ApS
IUPHAR-DB Database of human drug targets and their ligands [95] http://www.iuphar-db.org/
OpenCDLig Web application for sharing resources about cyclodextrin/ligand [96] https://kdd.di.unito.it/casmedchem/
complexes
PSMDB Protein - Small-Molecule Database [97] http://compbio.cs.toronto.edu/psmdb/
SambVca Web application for calculation of buried volume of organometallic [98] https://www.molnac.unisa.it/OMtools/
ligands sambvca.php
ScafBank Database of molecular scaffolds [99] http://202.127.30.184:8080/scafbank.html
SMARTCyp Web application for prediction of sites of cytochrome P450 [100] http://www.farma.ku.dk/smartcyp/
mediated metabolism
sMol Explorer Web application for exploring small-molecule datasets [101] http://www3a.biotec.or.th/isl/index.php/smol-
explorer
SuperImposé Web application for structural similarity between ligands, binding [102] http://farnsworth.charite.de/superimpose-
sites or proteins web/
SuperToxic Database of toxic compounds [103] http://bioinformatics.charite.de/supertoxic/
SuperSite Detailed information on, and comparisons of, protein-ligand [104] http://bioinf-tomcat.charite.de/supersite/
binding sites
SuperSweet Database of natural and artificial sweeteners [105] http://bioinf-applied.charite.de/sweet/
STITCH2 Chemical-protein interactions [106] http://stitch.embl.de/
VCCLAB Virtual Computational Chemistry Laboratory [107] http://www.vcclab.org/
wwLigCSRre Web application that performs ligand-based screening using 3D [108] http://bioserv.rpbs.univ-paris-diderot.fr/Help/
similarity wwLigCSRre.html

J. Cheminf. 2011, 3, 33.


• Many other examples exist [68-71]. Any restrictions to use by non-academics: None

The vital role that a cheminformatics toolkit plays in
Acknowledgements and Funding
the development of scientific resources is shown by We would like to thank all users and contributors to the Open Babel project
Tables 1 and 2. Table 1 lists examples of stand-alone over its history, including OpenEye Scientific Software Inc. for their initial
applications or programming libraries that rely on Open OELib code. We also thank the Blue Obelisk Movement for ideas, comments
on this manuscript, and support. We thank SourceForge for providing
Babel, either calling the library directly or via one of the resources for issue tracking and managing releases, and Kitware for
command-line executables. Table 2 contains examples additional dashboard resources. NMOB is supported by a Health Research
of web applications and databases that either use Open Board Career Development Fellowship (PD/2009/13).
Babel on the server or where Open Babel was used in Author details
the preparation of the data. 1
Analytical and Biological Chemistry Research Facility, Cavanagh Pharmacy
Building, University College Cork, Co. Cork, Ireland. 2Department of
Chemistry, Technische Universität München, Garching D-85747, Germany.
Conclusions 3
eMolecules, Inc., 420 Stevens Ave #120, Solana Beach, CA 92075, USA.
In November 2011, Open Babel will mark 10 years of 4
Open Babel development team. 5University of Pittsburgh, Department of
existence as an independent project, and for the first Chemistry, 219 Parkman Avenue, Pittsburgh, PA 15217, USA.
time, we have discussed its development and features. Authors’ contributions
As shown by more than 400 citations, it has become an GRH is the lead developer of the Open Babel project. CAJ, CM, MB, NMOB,
essential tool for handling the myriad of molecular file and TV are developers of Open Babel. All authors read and approved the
final manuscript.
formats encountered in diverse branches of chemistry.
While more work remains to be done, through valida- Competing interests
tion processes such as those described above and the The authors declare that they have no competing interests.
recent introduction of a nightly build and testing frame- Received: 27 June 2011 Accepted: 7 October 2011
work, we aim to improve the quality and robustness of Published: 7 October 2011
the toolkit with each new release.
Looking forward to the future, one of the goals of the References
1. Weininger D: SMILES, a chemical language and information system. 1.
project is to extend support to molecules that currently Introduction to methodology and encoding rules. J Chem Inf Comput Sci
are not handled very well by existing cheminformatics 1988, 28:31-36.
toolkits. Typically toolkits focus on the types of mole- 2. Murray-Rust P, Rzepa H: Chemical markup, XML, and the Worldwide Web.
1. Basic principles. J Chem Inf Comput Sci 1999, 39:928-942.
cules of principal importance to the pharmaceutical 3. Murray-Rust P, Rzepa HS: Chemical Markup, XML and the World-Wide
industry, namely stable organic molecules comprising Web. 2. Information Objects and the CMLDOM. J Chem Inf Model 2001,
wholly of 2-center 2-electron covalent bonds. Molecules 41:1113-1123.
4. Murray-Rust P, Rzepa H, Wright M: Development of chemical markup
outside this set - such as radicals, organometallic and language (CML) as a system for handling complex chemical content.
inorganic molecules, molecules with coordinate bonds New J Chem 2001, 25:618-634.
or 3-center 2-electron bonds - are poorly supported in 5. Murray-Rust P, Rzepa H: Chemical Markup, XML, and the World Wide
Web. 4. CML Schema. J Chem Inf Comput Sci 2003, 43:757-772.
general. Future releases of Open Babel will provide sub- 6. Holliday GL, Murray-Rust P, Rzepa HS: Chemical Markup, XML, and the
stantially improved handling of such species. We also World Wide Web. 6. CMLReact, an XML Vocabulary for Chemical
seek to improve speed and coverage of important meth- Reactions. J Chem Inf Model 2006, 46:145-157.
7. Daylight Theory: :, SMARTS http://www.daylight.com/dayhtml/doc/theory/
ods such as structure generation, kekulization and theory.smarts.html.
canonicalization. 8. Fogel K: Producing Open Source Software: How to Run a Successful Free
Open Babel is freely available from http://openbabel. Software Project O’Reilly Media, Inc. Sebastopol, CA; 2005.
9. Citations were generated by Google Scholar:[http://scholar.google.com/
org, and new community members are very welcome scholar?
(users, developers, bug reporters, feature requesters). For as_q=openbabel&num=10&as_occt=any&as_publication=&as_ylo=2001].
information on how to use Open Babel, please see the 10. A selection of such projects is included below. :, The full list is available at:
http://openbabel.org/wiki/Related_Projects.
documentation at http://openbabel.org/docs and the API 11. Open Babel: :[http://openbabel.org/].
documentation at http://openbabel.org/api. 12. Open Babel Report Format: :[http://openbabel.org/docs/2.3.0/FileFormats/
Open_Babel_report_format.html].
13. Open Babel Fingerprint Format: :[http://openbabel.org/docs/2.3.0/
Availability and Requirements FileFormats/Fingerprint_format.html].
Project Name: Open Babel 14. Open Babel Fastsearch Format: :[http://openbabel.org/docs/2.3.0/
Project home page: http://openbabel.org FileFormats/Fastsearch_format.html].
15. MolPrint2D Format: :[http://openbabel.org/docs/2.3.0/FileFormats/
Operating system(s): Cross-platform MolPrint2D_format.html].
Programming language: C++, bindings to Python, 16. Bender A, Mussa HY, Glen RC, Reiling S: Molecular Similarity Searching
Perl, Ruby, Java, C# Using Atom Environments, Information-Based Feature Selection, and a
Naïve Bayesian Classifier. J Chem Inf Model 2004, 44:170-178.
Other requirements (if compiling): CMake 2.4+ 17. MNA Format: :[http://openbabel.org/docs/2.3.0/FileFormats/
License: GNU GPL v2 Multilevel_Neighborhoods_of_Atoms_(MNA).html].

J. Cheminf. 2011, 3, 33.


18. Filimonov D, Poroikov V, Borodina Y, Gloriozova T: Chemical Similarity 47. Arbor S, Marshall GR: A virtual library of constrained cyclic tetrapeptides
Assessment through Multilevel Neighborhoods of Atoms: Definition and that mimics all four side-chain orientations for over half the reverse
Comparison with the Other Descriptors. J Chem Inf Model 1999, turns in the protein data bank. J Comput-Aided Mol Des 2008, 23:87-95.
39:666-670. 48. Huang Z, Wong CF: A Mining Minima Approach to Exploring the
19. PDB Format v3.2: :[http://www.wwpdb.org/documentation/format32/v3.2. Docking Pathways of p-Nitrocatechol Sulfate to YopH. Biophys J 2007,
html]. 93:4141-4150.
20. PDB: Cruft to Content: :[http://www.daylight.com/meetings/mug01/Sayle/ 49. Hill AD, Reilly PJ: A Gibbs free energy correlation for automated docking
m4xbondage.html]. of carbohydrates. J Comput Chem 2008, 29:1131-1141.
21. Morgan HL: The Generation of a Unique Machine Description for 50. Armen RS, Chen J, Brooks CL III: An Evaluation of Explicit Receptor
Chemical Structures-A Technique Developed at Chemical Abstracts Flexibility in Molecular Docking Using Molecular Dynamics and Torsion
Service. J Chem Docum 1965, 5:107-113. Angle Molecular Dynamics. J Chem Theory Comp 2009, 5:2909-2923.
22. Nauty: :[http://cs.anu.edu.au/~bdm/nauty/]. 51. Liu L, Ma H, Yang N, Tang Y, Guo J, Tao W, Jaa Duan: A Series of Natural
23. McKay BD: Practical graph isomorphism. Congressus Numerantium 1981, Flavonoids as Thrombin Inhibitors: Structure-activity relationships.
30:45-87. Thromb Res 2010, 126:e365-e378.
24. Gakh A, Burnett M: Modular Chemical Descriptor Language (MCDL): 52. Wallach I, Jaitly N, Lilien R: A Structure-Based Approach for Mapping
Composition, connectivity, and supplementary modules. J Chem Inf Adverse Drug Reactions to the Perturbation of Underlying Biological
Comput Sci 2001, 41:1494-1499. Pathways. PLoS One 2010, 5:e12063.
25. Trepalin SV, Yarkov AV, Pletnev IV, Gakh AA: A Java Chemical Structure 53. Paila YD, Tiwari S, Sengupta D, Chattopadhyay A: Molecular modeling of
Editor Supporting the Modular Chemical Descriptor Language (MCDL). the human serotonin1A receptor: role of membrane cholesterol in
Molecules 2006, 11:219-231. ligand binding of the receptor. Molecular BioSystems 2011, 7:224-234.
26. Gakh AA, Burnett MN, Trepalin SV, Yarkov AV: Modular Chemical 54. Melville JL, Hirst JD: TMACC: Interpretable Correlation Descriptors for
Descriptor Language (MCDL): Stereochemical modules. J Cheminf 2011, Quantitative Structure−Activity Relationships. J Chem Inf Model 2007,
3:5. 47:626-634.
27. Halgren T: Merck molecular force field .1. Basis, form, scope, 55. Pencheva T, Lagorce D, Pajeva I, Villoutreix BO, Miteva MA: AMMOS:
parameterization, and performance of MMFF94. J Comput Chem 1996, Automated Molecular Mechanics Optimization tool for in silico
17:490-519. Screening. BMC Bioinformatics 2008, 9:438.
28. Halgren T: Merck molecular force field .2. MMFF94 van der Waals and 56. Schietgat L, Ramon J, Bruynooghe M: An Efficiently Computable Graph-
electrostatic parameters for intermolecular interactions. J Comput Chem Based Metric for the Classification of Small Molecules. Proceedings of the
1996, 17:520-552. 11th International Conference on Discovery Science Springer-Verlag Berlin,
29. Halgren T: Merck molecular force field .3. Molecular geometries and Heidelberg; 2008, 197-209.
vibrational frequencies for MMFF94. J Comput Chem 1996, 17:553-586. 57. Krier M, Hutter MC: Bioisosteric Similarity of Molecules Based on
30. Halgren T, Nachbar R: Merck molecular force field .4. Conformational Structural Alignment and Observed Chemical Replacements in Drugs. J
energies and geometries for MMFF94. J Comput Chem 1996, 17:587-615. Chem Inf Model 2009, 49:1280-1297.
31. Halgren T: Merck molecular force field .5. Extension of MMFF94 using 58. Wang X, Huan J, Smalter A, Lushington GH: Application of kernel
experimental data, additional computational data, and empirical rules. J functions for accurate similarity search in large chemical databases. BMC
Comput Chem 1996, 17:616-641. Bioinformatics 2010, 11:S8.
32. Andronico A, Randall A, Benz RW, Baldi P: Data-driven high-throughput 59. Cheng T, Li Q, Wang Y, Bryant SH: Binary Classification of Aqueous
prediction of the 3-D structure of small molecules: review and progress. Solubility Using Support Vector Machines with Reduction and
J Chem Inf Model 2011, 51:760-776. Recombination Feature Selection. J Chem Inf Model 2011, 51:229-236.
33. Rappe A, Casewit C, Colwell K, Goddard W III, Skiff WM: UFF, a full periodic 60. Mihaleva VV, Verhoeven HA, de Vos RCH, Hall RD, van Ham RCHJ:
table force field for molecular mechanics and molecular dynamics Automated procedure for candidate compound selection in GC-MS
simulations. J Am Chem Soc 1992, 114:10024-10035. metabolomics based on prediction of Kovats retention index.
34. Wang J, Wolf RM, Caldwell JW, Kollman PA, Case DA: Development and Bioinformatics 2009, 25:787-794.
testing of a general amber force field. J Comput Chem 2004, 61. Bas DC, Rogers DM, Jensen JH: Very fast prediction and rationalization of
25:1157-1174. pKa values for protein-ligand complexes. Proteins: Struct, Funct, Bioinf
35. Wang J, Wang W, Kollman PA, Case DA: Automatic atom type and bond 2008, 73:765-783.
type perception in molecular mechanical calculations. J Molec Graph 62. Fabian L, Brock CP: A list of organic kryptoracemates. Acta Cryst 2010,
Model 2006, 25:247-260. B66:94-103.
36. O’Boyle NM, Vandermeersch T, Flynn CJ, Maguire AR, Hutchison GR: Confab 63. Dehmer M, Barbarini N, Varmuza K, Graber A: A Large Scale Analysis of
- Systematic generation of diverse low-energy conformers. J Cheminf Information-Theoretic Network Complexity Measures Using Chemical
2011, 3:8. Structures. PLoS One 2009, 4:e8057.
37. CMake: :[http://www.cmake.org/]. 64. Langham JJ, Jain AN: Accurate and Interpretable Computational
38. Martin K, Hoffman B: Mastering CMake: A Cross-Platform Build System. Modeling of Chemical Mutagenicity. J Chem Inf Model 2008, 48:1833-1839.
Kitware, Inc., Clifton Park, NY;, 5 2010. 65. Fontaine F, Pastor M, Zamora I: Anchor-GRIND: Filling the gap between
39. CDash Dashboard for Open Babel: :[http://my.cdash.org/index.php? standard 3D QSAR and the GRid-INdependent Descriptors. J Med Chem
project=Open+Babel]. 2005, 48(7):2687-94.
40. O’Boyle N, Morley C, Hutchison GR: Pybel: a Python wrapper for the 66. Konyk M, De Leon A, Dumontier M: Chemical knowledge for the semantic
OpenBabel cheminformatics toolkit. Chem Cent J 2008, 2:5. web. Data Integration in the Life Sciences Springer-Verlag Berlin, Heidelberg;
41. Open Babel Bug Tracker: :[https://sourceforge.net/tracker/? 2008, 169-176.
limit=25&func=&group_id=40728&atid=428740&status=2]. 67. Kogej T, Engkvist O, Blomberg N, Muresan S: Multifingerprint Based
42. Doxygen: :[http://www.doxygen.org/]. Similarity Searches for Targeted Class Compound Selection. J Chem Inf
43. Open Babel API: :[http://openbabel.org/api]. Model 2006, 46:1201-1213.
44. Myers J, Allison T, Bittner S, Didier B, Frenklach M, Green W, Ho Y, 68. Reynès C, Host H, Camproux A-C, Laconde G, Leroux F, Mazars A, Deprez B,
Hewson J, Koegler W, Lansing C, et al: A collaborative informatics Fahraeus R, Villoutreix BO, Sperandio O: Designing Focused Chemical
infrastructure for multi-scale science. Cluster Computing 2005, 8:243-253. Libraries Enriched in Protein-Protein Interaction Inhibitors using
45. Lind P, Alm M: A Database-Centric Virtual Chemistry System. J Chem Inf Machine-Learning Methods. PLoS Computational Biology 2010, 6:e1000695.
Model 2006, 46:1034-1039. 69. Lagorce D, Pencheva T, Villoutreix BO, Miteva MA: DG-AMMOS: A New tool
46. Amini A, Shrimpton PJ, Muggleton SH, Sternberg MJE: A general approach to generate 3D conformation of small molecules using Distance
for developing system-specific functions to score protein-ligand docked Geometry and Automated Molecular Mechanics Optimization for in
complexes using support vector inductive logic programming. Proteins: silico Screening. BMC Chemical Biology 2009, 9:6.
Struct, Funct, Bioinf 2007, 69:823-831.

J. Cheminf. 2011, 3, 33.


70. Gómez MJ, Pazos F, Guijarro FJ, de Lorenzo V, Valencia A: The 94. Miteva MA, Guyon F, Tuffery P: Frog2: Efficient 3D conformation
environmental fate of organic pollutants through the global microbial ensemble generator for small compounds. Nucleic Acids Res 2010, 38:
metabolism. Molecular Systems Biology 2007, 3:114. W622-W627.
71. Kazius J, Nijssen S, Kok J, Bäck T, IJzerman AP: Substructure Mining Using 95. Sharman JL, Mpamhanga CP, Spedding M, Germain P, Staels B, Dacquet C,
Elaborate Chemical Representation. J Chem Inf Model 2006, 46:597-605. Laudet V, Harmar AJ, NC-IUPHAR: IUPHAR-DB: new receptors and tools for
72. O’Boyle NM, Tenderholt AL, Langner KM: cclib: A library for package- easy searching and visualization of pharmacological data. Nucleic Acids
independent computational chemistry algorithms. J Comput Chem 2008, Res 2010, 39:D534-D538.
29:839-845. 96. Esposito R, Ermondi G, Caron G: OpenCDLig: a free web application for
73. Brüstle M: Chemtool - Moleküle zeichnen mit dem Pinguin. Nachrichten sharing resources about cyclodextrin/ligand complexes. J Comput-Aided
aus der Chemie 2001, 49:1310-1313. Mol Des 2009, 23:669-675.
74. Buehler M, Dodson J, van Duin A: The Computational Materials Design 97. Wallach I, Lilien R: The protein-small-molecule database, a non-redundant
Facility (CMDF): A powerful framework for multi-paradigm multi-scale structural resource for the analysis of protein-ligand binding.
simulations. Materials Research Society symposium proceedings 2006, 894: Bioinformatics 2009, 25:615-620.
LL3.8. 98. Poater A, Cosenza B, Correa A, Giudice S, Ragone F, Scarano V, Cavallo L:
75. Bullock CW, Jacob RB, McDougal OM, Hampikian G, Andersen T: Samb Vca: A Web Application for the Calculation of the Buried Volume
Dockomatic - automated ligand creation and docking. BMC Research of N-Heterocyclic Carbene Ligands. Eur J Inorg Chem 2009,
Notes 2010, 3:289. 2009:1759-1766.
76. Jiang X, Kumar K, Hu X, Wallqvist A, Reifman J: DOVIS 2.0: an efficient and 99. Yan B-b, Xue M-z, Xiong B, Liu K, Hu D-y, Shen J-k: ScafBank: a public
easy to use parallel virtual screening tool based on AutoDock 4.0. Chem comprehensive Scaffold database to support molecular hopping. Acta
Cent J 2008, 2:18. Pharmacologica Sinica 2009, 30:251-258.
77. Lagorce D, Sperandio O, Galons H, Miteva MA, Villoutreix BO: FAF-Drugs2: 100. Rydberg P, Gloriam DE, Olsen L: The SMARTCyp cytochrome P450
Free ADME/tox filtering tool to assist drug discovery and chemical metabolism prediction server. Bioinformatics 2010, 26:2988-2989.
biology projects. BMC Bioinformatics 2008, 9:396. 101. Ingsriswang S, Pacharawongsakda E: sMOL Explorer: an open source, web-
78. Maunz A, Helma C, Kramer S: Efficient mining for structurally diverse enabled database and exploration tool for Small MOLecules datasets.
subgraph patterns in large molecular databases. Machine Learning 2010, Bioinformatics 2007, 23:2498-2500.
83:193-218. 102. Bauer RA, Bourne PE, Formella A, Frommel C, Gille C, Goede A, Guerler A,
79. Maunz A, Helma C, Kramer S: Large-scale graph mining using backbone Hoppe A, Knapp EW, Poschel T, et al: Superimpose: a 3D structural
refinement classes. Proceedings of the 15th ACM SIGKDD International superposition server. Nucleic Acids Res 2008, 36:W47-W54.
Conference on Knowledge Discovery and Data Mining (KDD 2009) ACM Paris; 103. Schmidt U, Struck S, Gruening B, Hossbach J, Jaeger IS, Parol R,
2009, 617-626. Lindequist U, Teuscher E, Preissner R: SuperToxic: a comprehensive
80. Helma C: Lazy structure-activity relationships (lazar) for the prediction of database of toxic compounds. Nucleic Acids Res 2009, 37:D295-D299.
rodent carcinogenicity and Salmonella mutagenicity. Mol Diversity 2006, 104. Bauer RA, Gunther S, Jansen D, Heeger C, Thaben PF, Preissner R: SuperSite:
10:147-158. dictionary of metabolite and drug binding sites in proteins. Nucleic Acids
81. Meineke MA, Vardeman CF, Lin T, Fennell CJ, Gezelter JD: OOPSE: an Res 2009, 37:D195-D200.
object-oriented parallel simulation engine for molecular dynamics. J 105. Ahmed J, Preissner S, Dunkel M, Worth CL, Eckert A, Preissner R:
Comput Chem 2005, 26:252-271. SuperSweet–a resource on natural and artificial sweetening agents.
82. Tosco P, Balle T: Brute-force pharmacophore assessment and scoring Nucleic Acids Res 2010, 39:D377-D382.
with Open3DQSAR. J Cheminf 2011, 3(Suppl 1):P39. 106. Kuhn M, Szklarczyk D, Franceschini A, Campillos M, von Mering C,
83. Tosco P, Balle T: Open3DQSAR: a new open-source software aimed at Jensen LJ, Beyer A, Bork P: STITCH 2: an interaction network database for
high-throughput chemometric analysis of molecular interaction fields. J small molecules and proteins. Nucleic Acids Res 2009, 38:D552-D556.
Mol Model 2011, 17:201-208. 107. Tetko IV, Gasteiger J, Todeschini R, Mauri A, Livingstone D, Ertl P,
84. Filippov IV, Nicklaus MC: Optical Structure Recognition Software To Palyulin VA, Radchenko EV, Zefirov NS, Makarenko AS, et al: Virtual
Recover Chemical Information: OSRA, An Open Source Solution. J Chem Computational Chemistry Laboratory - Design and Description. J
Inf Model 2009, 49:740-743. Comput-Aided Mol Des 2005, 19:453-463.
85. Koes DR, Camacho CJ: Pharmer: Efficient and Exact Pharmacophore 108. Sperandio O, Petitjean M, Tuffery P: wwLigCSRre: a 3D ligand-based server
Search. J Chem Inf Model 2011, 51(6):1307-14. for hit identification and optimization. Nucleic Acids Res 2009, 37:
86. Jacob CR, Beyhan SM, Bulo RE, Gomes ASP, Götz AW, Kiewisch K, Sikkema J, W504-W509.
Visscher L: PyADF - A scripting framework for multiscale quantum
chemistry. J Comput Chem 2011, 32:2328-2338. doi:10.1186/1758-2946-3-33
87. Green HWilliam, Allen WJoshua, Ashcraft WRobert, Beran JGregory, Cite this article as: O’Boyle et al.: Open Babel: An open chemical
Class ACaleb, Gao Connie, Franklin Goldsmith C, Harper RMichael, toolbox. Journal of Cheminformatics 2011 3:33.
Jalan Amrit, Magoon RGregory, Matheu MDavid, Merchant SShamel,
Mo DJeffrey, Petway Sarah, Raman Sumathy, Sharma Sandeep, Song Jing,
Van Geem MKevin, Wen John, West HRichard, Wong Andrew, Wong Hsi-
Wu, Yelvington EPaul, Yu Joanna: RMG - Reaction Mechanism Generator
v3.3. 2011 [http://rmg.sourceforge.net/].
88. Karwath A, De Raedt L: SMIREP: Predicting Chemical Activity from SMILES.
J Chem Inf Model 2006, 46:2432-2444.
89. Lonie DC, Zurek E: XTALOPT: An open-source evolutionary algorithm for
crystal structure prediction. Comput Phys Commun 2011, 182:372-387. scientist can read your work free of charge
90. Zonta N, Grimstead IJ, Avis NJ, Brancale A: Accessible haptic technology
for drug design applications. J Mol Model 2008, 15:193-196.
91. Chen JH, Linstead E, Swamidass SJ, Wang D, Baldi P: ChemDB update full- colleagues in other parts of the globe, by allowing
text search and virtual chemical space. Bioinformatics 2007, 23:2348-2351. anyone to view the content free of charge.
92. Backman TWH, Cao Y, Girke T: ChemMine tools: an online service for W. Jeffery Hurst, The Hershey Company.
analyzing and clustering small molecules. Nucleic Acids Res 2011, 39(Web
Server issue):W486-91. available free of charge to the entire scientific community
93. Ahmed J, Worth CL, Thaben P, Matzig C, Blasse C, Dunkel M, Preissner R: peer reviewed and published immediately upon acceptance
FragmentStore–a comprehensive database of fragments linking cited in PubMed and archived on PubMed Central
metabolites, toxic molecules and drugs. Nucleic Acids Res 2010, 39:
D1049-D1054.

J. Cheminf. 2011, 3, 33.

Part II

Enzyme reaction mechanisms

39

Vol. 21 no. 23 2005, pages 4315–4316
BIOINFORMATICSAPPLICATIONS NOTE doi:10.1093/bioinformatics/bti693

Databases and ontologies

MACiE: a database of enzyme reaction mechanisms
,†
Gemma L. Holliday1, Gail J. Bartlett2 , Daniel E. Almonacid1, Noel M. O’Boyle1,
Peter Murray-Rust1, Janet M. Thornton2 and John B. O. Mitchell1,Ã
1
Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge,
Lensfield Road, Cambridge, CB2 1EW, UK and 2EMBL-EBI, Wellcome Trust Genome Campus,
Hinxton, Cambridge, CB10 1SD, UK
Received on July 21, 2005; revised on September 22, 2005; accepted on September 23, 2005
Advance Access publication September 27, 2005

ABSTRACT DESIGN

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on October 22, 2011
Summary: MACiE (mechanism, annotation and classification in The MACiE dataset evolved from that published in the Catalytic
enzymes) is a publicly available web-based database, held in Site Atlas (CSA) (Bartlett et al., 2002; Porter et al., 2004), and each
CMLReact (an XML application), that aims to help our understanding entry is selected so that it fulfils the following criteria:
of the evolution of enzyme catalytic mechanisms and also to create a
classification system which reflects the actual chemical mechanism (1) There is a 3D crystal structure of the enzyme deposited in the
(catalytic steps) of an enzyme reaction, not only the overall reaction. Protein Databank (PDB) (Berman et al., 2000).
Availability: http://www-mitchell.ch.cam.ac.uk/macie/ (2) There is a relatively well-understood mechanism available.
Contact: jbom1@cam.ac.uk Taken from the literature, these cover a variety of
methodologies, including chemical and biochemical studies,
A great deal of knowledge about enzymes, including structures, quantum mechanical calculations and structual biology
gene sequences, mechanisms, metabolic pathways and kinetic reports.
data, now exists. However, it is spread between many different (3) The enzyme is unique at the H level of the CATH
databases and throughout the literature. Here we announce the classification—a hierarchical classification system of
completion of the initial version of MACiE, a unique database of protein domain structures (Orengo et al., 1997)—unless
the chemical mechanisms of enzymatic reactions. there is a homologue with a significantly different chemical
Web resources such as BRENDA (Schomburg et al., 2004), mechanism.
KEGG (Kanehisa et al., 2004) and the International Union of Bio- (4) Where there are a number of possible PDB codes available
chemistry and Molecular Biology (IUBMB) Enzyme Nomenclature the entry should be, if possible, a wild-type enzyme.
website (IUBMB, 2005, http://www.chem.qmul.ac.uk/iubmb/
enzyme/) contain descriptions of the overall reactions performed All MACiE enzymes are also contained in the Enzyme Commis-
by enzymes, accompanied in some cases by a textual or graphical sion (EC) classification system (IUBMB, 2005, http://www.chem.
description of the mechansim. MACiE is unique in combining qmul.ac.uk/iubmb/enzyme/), that is, they all have four number codes
detailed stepwise mechanistic information (including 2D anima- describing their overall reaction. The first level (Class) describes
tions), a wide coverage of both chemical space and the protein the basic reaction type. The second and third levels (subclass and
structure universe, and the chemical intelligence of CMLReact sub-subclass, respectively) describe the reaction in further detail
(Holliday,C.L., Murray-Rust,P., and Rzepa,H.S., 2005, manuscript and the final level (serial number) describes substrate specificity.
submitted to J. Chem. Inf. Modeling). MACiE usefully complements For example, the b-lactamases (Fig. 1) are assigned the EC number
both the mechanistic detail of the Structure–Function Linkage 3.5.2.6, i.e. a hydrolase (3) acting on a C–N bond (5) in a cyclic
Database (SFLD) for a small number of enzyme superfamilies amide (2) with a b-lactam as the substrate (6).
(Pegg et al., 2005) and the wider coverage with less chemical In MACiE, the data centre on the catalytic steps involved in the
detail provided by EzCatDB (Nagano, 2005) which also contains chemical mechanism as well as the overall reaction. Each entry
a limited number of 3D animations. includes the following steps:
Enzyme name and EC number
Ã
To whom correspondence should be addressed. PDB code and CATH codes of all domains in the enzyme
†
Present Address: Bioinformatics Support Service (Biochemistry Building), Diagram and annotation of the overall reaction
Centre for Bioinformatics, Division of Molecular Biosciences, Faculty of
Life Sciences, Imperial College London, London, SW7 2AZ, UK Primary literature references

Ó The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access
version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University
Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its
entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org

Bioinformatics. 2005, 21, 4315-4316.

G.L.Holliday et al.

R R'
R R' CURATION
The annotation process involves input and validation steps. Terms
H have been rigorously defined either from the IUPAC Gold Book
H +
O N
R''
(McNaught et al., 1997), such as chemical terms like hydrolysis, or
N
O O
H from primary literature, such as mechanism, which is defined using
O R''
H Ingold’s terminology (Ingold, 1969), originally put forward in the
1930s. All of the technical and scientific terms used in MACiE are
Fig. 1. The overall reaction for a b-lactamase. contained in the MACiE dictionary, which is available at the URL
http://www-mitchell.ch.cam.ac.uk/macie/glossary.html and is also
Diagram and annotation of all reaction steps, including: available as a raw XML file.
—The Ingold mechanism (Ingold, 1969) The entries online are accessed via an HTML look-up table and
—Diagram and function of catalytic amino acid residues include all of the information available in the database. The original
—Information on the reactive centres and bond changes ISIS/Base format file and the raw CML files can be supplied.
Comments on the reaction (where applicable).

CONTENT FUTURE WORK

The criteria defined in the Design section initially produced a Future work includes expanding the dataset to include a representat-
dataset of 100 entries. A single EC number may cover a plurality ive set of EC numbers (at the sub-subclass level), creating a search
of MACiE entries when different mechanisms bring about the interface for MACiE and developing authoring tools for MACiE
same overall chemical transformation, as with the two types of in CML. Ongoing research focuses on the evolution of enzyme
3-dehydroquinate dehydratase, and thus 100 MACiE entries span catalysis and the classification of enzyme reaction mechanisms.
only 96 EC numbers.
The 100 enzymes in Version 1 of MACiE incorporate domains
from 140 CATH homologous superfamilies. MACiE currently cov- ACKNOWLEDGEMENTS
ers 56 of the 174 EC sub-subclasses present in the PDB, thus, we G.J.B. would like to thank Dr Jonathan Goodman for his invaluable
feel that we have a representative coverage of EC reaction space help with organic chemistry queries. We would also like to
(comparative EC wheels are available at URL http://www-mitchell. thank the EPSRC (G.L.H. and J.B.O.M.), the BBSRC (G.J.B. and
ch.cam.ac.uk/macie/ECCoverage/). We anticipate that all 158 sub- J.M.T.—CASE studentship in association with Roche Products
subclasses for which both structures and reliable mechanisms are Ltd; N.M.O.B. and J.B.O.M.—grant BB/C51320X/1), the Chilean
available will be represented in the forthcoming MACiE Version 2. ´
Government’s Ministerio de Planificacion y Cooperacion and´
Cambridge Overseas Trust (D.E.A.) for funding and Unilever for
SOFTWARE supporting the Centre for Molecular Science Informatics.
The data are initially entered in MDL’s ISIS/Base, a database pack- Conflict of Interest: none declared.
age for chemical reactions, validated by at least two people, and
then converted into CMLReact using the Jumbo Toolkit (Wakelin
et al., 2005) to create an information and semantically rich database.
REFERENCES
At this stage we add extra fields of information to the CMLReact
version of MACiE that are unavailable in the ISIS version, Bartlett,G.J. et al. (2002) Analysis of catalytic residues in enzyme active sites.
J. Mol. Biol., 324, 105–121.
including the CATH code. Jumbo is a set of Java-based software
Berman,H.M. et al. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235–242.
which converts the MDL file format produced from ISIS/Base into Holliday,G.L. et al. (2004) CMLSnap: animated reaction mechanisms. Internet
CMLReact. The MacieConverter section of Jumbo performs the J. Chem., 7, Article 4.
following functions: Ingold,C.K. (1969) Structure and Mechanism in Organic Chemistry. 2nd edn, Cornell
University Press, Ithaca, NY, Chapters 5–15.
Integration of the files in the ISIS/Base version of MACiE Kanehisa,M. et al. (2004) The KEGG resource for deciphering the genome. Nucleic
Acids Res., 32, D277–D280.
Identification of reactant, product and spectator molecules McNaught,A.D. and Wilkinson,A. (1997) International Union of Pure and Applied
Splitting of groups of molecules Chemistry Compendium of Chemical Terminology (‘‘The Gold Book’’). 2nd edn,
ISBN 0-8-654-26848.
Automatic mapping of atoms within the reaction Nagano,N. (2005) EzCatDB: the Enzyme Catalytic-mechanism Database. Nucleic
Checking for mass and charge conservation throughout the reac- Acids Res., 33, D407–D412.
Orengo,C.A. et al. (1997) CATH—a hierarchic classification of protein domain
tion (stoichiometry) structures. Structure, 5, 1093–1108.
Integration and checking of MACiE Dictionary entries. Pegg,S.C-H. et al. (2005) Representing structure-function relationships in mechanis-
tically diverse enzyme superfamilies. Pac. Symp. Biocomput., 358–369.
Once the conversion process has been completed, a further tool in Porter,C.T. et al. (2004) The Catalytic Site Atlas: a resource of catalytic sites and
the Jumbo Toolkit, called CMLSnap (Holliday et al., 2004), can residues identified in enzymes using structural data. Nucleic Acids Res., 32,
be used to create an animation of the reaction. This animation D129–D133.
Schomburg,I. et al. (2004) BRENDA, the enzyme database: updates and major new
includes all of the atoms and bonds involved as well as the electron developments. Nucleic Acids Res., 32, D431–D433.
movements, which are calculated automatically. It is expected that Wakelin,J. et al. (2005) CML tools and information flow in atomic scale simulations.
CML will become our primary method of data entry and storage. Mol. Simul., 31, 315–322.

4316

Bioinformatics. 2005, 21, 4315-4316.

Published online 1 November 2006 Nucleic Acids Research, 2007, Vol. 35, Database issue D515–D520
doi:10.1093/nar/gkl774

MACiE (Mechanism, Annotation and Classification
in Enzymes): novel tools for searching catalytic
mechanisms
Gemma L. Holliday*, Daniel E. Almonacid1, Gail J. Bartlett, Noel M. O’Boyle1,
James W. Torrance, Peter Murray-Rust1, John B. O. Mitchell1 and Janet M. Thornton

EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and 1Unilever Centre for
Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road,
Cambridge CB2 1EW, UK

Received August 4, 2006; Revised September 18, 2006; Accepted October 1, 2006

Downloaded from http://nar.oxfordjournals.org/ by guest on October 22, 2011
ABSTRACT data, it tends to be spread between many different databases
and throughout the literature. Most web resources relating to
MACiE (Mechanism, Annotation and Classification in enzymes [such as BRENDA (1), KEGG (2), the IUBMB
Enzymes) is a database of enzyme reaction mecha- Enzyme Nomenclature website (http://www.chem.qmul.ac.
nisms, and is publicly available as a web-based data uk/iubmb/enzyme/) (3) and IntEnz (4)] focus on the overall
resource. This paper presents the first release of a reaction, accompanied in some cases by a textual or graphical
web-based search tool to explore enzyme reaction description of the mechanism. However, this does not allow
mechanisms in MACiE. We also present Version 2 of for detailed in silico searching of the chemical steps which
MACiE, which doubles the dataset available (from take place in the reaction. MACiE (5) combines detailed
Version 1). MACiE can be accessed from http://www. stepwise mechanistic information [including 2-D animations
ebi.ac.uk/thornton-srv/databases/MACiE/ (6)], a wide coverage of both chemical space and the protein
structure universe, and the chemical intelligence of the
Chemical Markup Language for Reactions (CMLReact) (7).
This usefully complements both the mechanistic detail of
INTRODUCTION the Structure–Function Linkage Database (SFLD) for a
Enzymes are proteins that catalyse the repertoire of chemical small number of rather ‘promiscuous’ enzyme superfamilies
reactions found in nature, and as such are vitally important (8) and the wider coverage with less chemical detail provided
molecules. What is so fascinating about these proteins is by EzCatDB (9), which also contains a limited number of 3D
that they have a wonderful diversity and can carry out highly animations. Entries in MACiE are linked, where appropriate,
complex chemical conversions under physiological condi- to all of these related data resources.
tions and retain their stereospecificity and regiospecificity,
unlike many organic chemical reactions. They range in size
and can have molecular weights of several thousand to sev- DATASET AND CONTENT
eral million Daltons, and still they can catalyse reactions on The dataset for MACiE version 2 was devised to increase the
molecules as small as carbon dioxide or nitrogen, or as large enzyme reaction space coverage of MACiE while trying to
as a complete chromosome. keep structural homology to a minimum. Each entry added
Although enzymes are large molecules, the actual catalysis in the new version was selected so that it fulfils the following
only takes place in a small cavity, the active site. It is criteria:
here that a small number of amino acid residues contribute
to catalytic function, and where the substrates bind. With (i) The EC sub-subclass was not previously in MACiE.
the advent of structure determination methods for proteins (ii) There is a three-dimensional crystal structure of the
and by using clever chemical/biochemical experimental enzyme deposited in the Protein Data Bank (wwPDB)
design, scientists have been able to propose catalytic mecha- (10).
nisms for many enzymes. Although a great deal of knowledge (iii) There is a mechanism available from the primary
exists for enzymes, including their structures, gene literature which explains most of the observed experi-
sequences, mechanisms, metabolic pathways and kinetic mental results.

*To whom correspondence should be addressed. Tel: +44 1223 492535; Fax: +44 1223 494486; Email: gemma@ebi.ac.uk
Present address:
Gail J. Bartlett, Division of Mathematical Biology, National Institute of Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, UK

Ó 2006 The Author(s).
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Nucleic Acid Res. 2007, 35, D515-D520.

D516 Nucleic Acids Research, 2007, Vol. 35, Database issue

Figure 1. EC wheels showing the EC coverage of MACiE Version 2 (left), the complete EC space (centre) and the coverage of EC space in the PDB by unique
EC serial numbers (right).

(iv) The enzyme is unique at the H level of the CATH code Table 1. Overall reaction annotation content
(11), unless the homologue already in MACiE has a
Catalysis and reaction Non-catalysis
significantly different chemical mechanism.

specific information specific information

Using the above criteria MACiE was expanded from Enzyme name PDB code
100 entries in version 1 to a total of 202 entries, which (common IUPAB/JCBN name)
EC code Non-catalytic domain CATH code
span 199 EC numbers (version 1 spanned 96 EC numbers) Catalytic residues involved Non-catalytic UniProt code
and covers a total of 862 reaction steps. There are almost Cofactors involved Species name (common and scientific)
4000 EC numbers defined, but the number of different Reactants and products Other database
reaction mechanisms needed to bring about all these overall identifiers, e.g. EzCatDB, SFLD, etc.
transformations is not clear. For example, the serine protease Catalytic domain CATH code Literature references
Catalytic UniProt code
family of proteins has many different substrates, but the Bonds involved, formed,
mechanisms are broadly similar. In contrast the b-lactamase cleaved, changed in order
enzymes, which have the same EC number, have four com- Reactive centres
pletely different mechanisms. Within the EC code, the fourth Overall reaction comments
digit usually defines the substrate specificity, which can be
very variable in large enzyme families—but the reaction
mechanisms for enzymes with the same first three digits are the current release such annotations are only available as
usually essentially the same. In total there are 224 EC sub- comments on the stage or overall reaction, although future
subclasses, with only 181 having known structures (12). Of releases of MACiE will include full entries for these alterna-
these MACiE covers 158, i.e. 87%. However, there are proba- tives.
bly many more mechanisms that are yet to be defined or Further details of the annotation process and a glossary of
discovered. terms used can be found on the MACiE website (http://www.
As can be seen from Figure 1, MACiE covers a good ebi.ac.uk/thornton-srv/databases/MACiE/documentation/ and
proportion of the EC reaction space, with an average relative http://www.ebi.ac.uk/thornton-srv/databases/MACiE/glossary.
difference between the size of corresponding EC classes html, respectively).
of 4%, with the transferases having the largest difference.
When the coverage with respect to EC code present in
the PDB is examined, it can be seen that MACiE again
DATABASE STRUCTURE
represents the coverage of enzymes with known structures
very well, with an average relative difference between the The challenge with MACiE has been to capture and usefully
corresponding EC classes in MACiE of 5%. represent all the different catalytic steps that occur during the
All entries in MACiE contain overall reaction annotation course of an enzymatic reaction. These reactions may consist
including the information detailed in Table 1. Each elemen- of any number of steps, and in MACiE we have reactions
tary reaction or step within an entry is fully annotated as is ranging from 1 step to 16 steps. The representation of these
detailed in Figure 2, this includes comments that have been reactions has evolved from a flat file entered in a commer-
added by the annotators. An extension of the content from cially available chemical database program (ISIS/Base) to
MACiE Version 1 is the addition of inferred return steps. the highly structured and powerful CMLReact (7), which is
These are explicitly labelled as being inferred in the comment an application of XML (the eXtensible Markup Language).
field and are necessary to return the enzyme to a state where it The final step in this evolution has been the conversion of
is ready to undergo another round of catalysis. the CMLReact into the relational database format of MySQL.
There is sometimes more than one proposed mechanism that CMLReact has a heirarchical structure, facilitating its
is consistent with the available experimental data. In MACiE, conversion into the relational database format of MySQL.
we have attempted not only to choose the best supported The conversion relies on the CML Schema and requires the
mechanism, but also where possible to annotate enzymes MACiE entries to be consistent with the Schema, which
with reasonable alternative mechanisms. Unfortunately, in adds an internal consistency check into our authoring process.


Nucleic Acids Research, 2007, Vol. 35, Database issue D517

Figure 2. An example of the annotation found in a MACiE entry. Reaction shown corresponds to fructose-bisphosphate aldolase (entry 52).

Table 2. Searches available in MACiE

Basic Complex

MACiE entry identifier Species name (overall annotation)
Current EC codes Overall reactants and products
Obsolete EC codes Reaction comments (overall reactions
and steps)
Catalytic Domain Amino acid residues (up to six residues)
CATH codes
All CATH codes Step mechanisms and/or mechanism
components (single and combinations of)
PDB code Chemical changes Figure 3. EC code search heuristics.
Enzyme name Chemical changes with mechanism or
mechanism components
Catalytic Domain Chemical changes with amino acid DATABASE FEATURES
UniProt Codes residues
All UniProt Codes Amino acid residues with mechanism or The original release of MACiE contained static images and
mechanism components annotation for the overall reaction and each step associated
Chemical changes with amino acid with the mechanism; it also included an animated reaction
residues and mechanisms or mechanism mechanism for approximately half the reactions then in
components
Alternative mechanisms
MACiE. Links to various related resources, such as the
RCSB PDB (13), IUBMB nomenclature database, CATH,
EzCatDB, PDBSum (14), BRENDA, the Catalytic Site
Atlas (15), KEGG and the Enzyme Structures Database,
Each CML tag-type becomes an MySQL table; each tag
were also included. This new release extends these links to
becomes a row in that MySQL table; each attribute of that
include the Macromolecular Structures Database (MSD)
tag corresponds to a column in the MySQL table. The tree
(16), SFLD, UniProt (17), and replaces the IUBMB nomen-
structure of the CML is preserved in the MySQL version;
clature database links with links to IntEnz. The new features
for each row of each table, there are columns specifying
in MACiE are detailed in the following sections.
which row of which other table corresponds to the row’s
parent tag in the CML version.
Searching MACiE
The CML version of MACiE, which is the ofﬁcial archive
version, is available from the website as individual entries, There are two levels of search implemented in MACiE. The
and the new website uses the relational version of MACiE basic level searches are implemented from the main page
to perform the online analysis and searching. (http://www.ebi.ac.uk/thornton-srv/databases/MACiE) and are



Figure 4. Advanced EC search heuristics.

walk up the EC code tree until it finds a match, no matter at
what level the search is entered. Thus the search will always
return a result. As the EC code of enzymes may change over
time, a search for obsolete EC codes has also been imple-
mented, although this search will not always return a result.
However, it should be noted that the higher up the EC hierar-
chy search has gone, the less likely it is that the returned
mechanism will be a match to the query. The obsolete EC
code search works in the same way as the current EC code.
If no matches are found at the serial number level of the
Figure 5. PDB search heuristics.
EC code, an advanced search option will allow the user to
search for a structural homologue of an enzyme with a
given EC code, which is shown in Figure 4 and described
below. This advanced search option takes the entered EC
code and finds the PDB codes of all of the matches to that
EC code in the Catalytic Site Atlas (CSA). A homology
search is then performed on those PDB codes for a match
in MACiE. This homology search is described in more detail
in the following section.
The CSA is a database of catalytic residues in proteins of
known structure. It contains much less mechanistic informa-
tion than MACiE, but has a considerably wider coverage of
protein structures than MACiE does. This wider coverage is
Figure 6. Enzyme name search heuristics.
partly because the CSA contains not only manually annotated
entries, but also contains entries that are automatically
annotated based on sequence alignment to the manual entries.
mainly for accessing the entries from the top level, i.e. for
searching entries in MACiE by EC code, enzyme name, PDB code. There are over 19 000 crystal structures relating to
etc. The complex searches are all available from the query enzymes deposited in the PDB. As MACiE entries require
pages of MACiE (http://www.ebi.ac.uk/thornton-srv/databases/ extensive literature searching and analysis, only a small
MACiE/queryMACiE.html) and are mainly for searching for fraction of these PDB entries are covered explicitly, 202 in
specific mechanisms, mechanism components or residues and total. However, we have used the CSA to identify homologues
their functions in the reaction steps, although there are some of these enzymes, extending this coverage to 7528 PDB codes.
overall reaction searches implemented as well. Table 2 lists Figure 5 details the search performed in MACiE, when a
the searches available in MACiE and the Supplementary Data protein structure described by a PDB code is entered.
contain a detailed listing of the searches available. Although the entries returned by this search will be homo-
The following sections describe searching by EC code, logues, this does not guarantee that the mechanism and the
PDB code or enzyme name, all of which use heuristics to catalytic residue assignments are the same. This is because
extend the coverage of MACiE. the homology method (see below) can retrieve very distant
relatives. Owing to this limitation, all homologous entries
EC code. The EC code search implemented in MACiE is are compared by EC code, and when there is a divergence
detailed in Figure 3 and can be accessed at any point in the between the MACiE entry and the homologue at the serial
scheme shown. The search for current EC numbers will always number level, this is clearly indicated to the user. We also


Nucleic Acids Research, 2007, Vol. 35, Database issue D519

list the amino acid residues that are annotated as catalytic in the results page we link both to the MACiE entry and the
both MACiE and the CSA. Thus it is clear if there is any CSA entry.
difference between EC numbers and catalytic residues. If
the EC number differs but the catalytic residues between Homology in MACiE. We have been working to bring
query and homologue are of identical types, it can be inferred MACiE and the CSA closer together. This includes using
that the mechanisms are likely to be the same, but where both the CSA to determine homologues (those enzymes which
differ, the mechanisms are unlikely to be transferable. From are evolutionarily related) of entries in MACiE. The CSA
ﬁnds homologues using a PSI-BLAST search (with an
E-value cut-off of 0.0005 and ﬁve iterations) against all
sequences currently in the PDB, plus all sequences in a
non-redundant subset of UniProt. The UniProt sequences
are included purely in order to increase the range of the
PSI-BLAST search by bridging gaps between distantly
related sequences in the PDB; only sequences occurring in
the PDB are retrieved for entry into the CSA. In the CSA,
and thus MACiE, homologous entries are only included
if the residues which align with the catalytic residues in
the parent literature entry are identical in residue type. In

other words, there must be no mutations at the catalytic res-
idue positions. There are, however, a few exceptions to this
rule:
(i) In order to allow for the many active site mutants in the
PDB, one (and only one) catalytic residue per site can be
different in type from the equivalent in the parent
literature entry. This is only permissible if all residue
spacing is identical to that in the parent literature entry,
and there are at least two catalytic residues.
(ii) Sites with only one catalytic residue are permitted to be
mutant provided that the residue number is identical to
that in the parent entry.
(iii) Fuzzy matching of residues is permitted within the
Figure 7. Growth of MACiE. This shows the growth in the number of EC following groups: [V,L,I], [F,W,Y], [S,T], [D,E], [K,R],
codes (blue), EC sub-sub classes (cyan) and catalytic domain CATH codes [D,N], [E,Q], [N,Q]. This fuzzy matching cannot be used
(red) in MACiE. in combination with rules (i) or (ii) above.

Figure 8. Frequency distribution of amino acid residues. This shows the frequency of catalytic amino acid residues in MACiE (blue), versus the frequency of
residues in MACiE (cyan), versus the frequency of residues in the wwPDB (red). The frequency of catalytic amino acid residues in MACiE is calculated by
taking the number of residues (of a given type) annotated in MACiE divided by the total number of annotated residues in MACiE, multiplied by 100.



Enzyme name. This is currently implemented as a partial and is also affiliated with Cambridge University Department
string match, thus entering ‘beta’ will return all the of Chemistry. Funding to pay the Open Access publication
b-lactamases and betaine-aldehyde dehydrogenase. If no charges for this article was provided by the Wellcome Trust.
results are returned from the partial name search, then the
name search heuristics (shown in Figure 6) are implemented. Conflict of interest statement. None declared.
This search utilizes the IntEnz database (4). MACiE
searches for a name in IntEnz, either a synonym, alternative
name or common name, and returns the EC code of that REFERENCES
name. The EC code is then used to search MACiE. If no
matches are found to the sub-subclass level of the EC code, 1. Schomburg,I., Chang,A., Ebeling,C., Gremse,M., Heldt,C., Huhn,G.
the user is offered an advanced EC code search (see Figure 4). and Schomburg,D. (2004) BRENDA, the enzyme database: updates
and major new developments. Nucleic Acids Res., 32,
D431–D433.
Statistics 2. Kanehisa,M., Goto,S., Kawashima,S., Okuno,Y. and Hattori,M. (2004)
The other major development in MACiE has been the The KEGG resource for deciphering the genome. Nucleic Acids Res.,
32, D277–D280.
inclusion of database statistics that are all generated on the 3. IUBMB (2005) Recommendations of the Nomenclature Committee of
ﬂy from the SQL tables. A full listing of the statistics the International Union of Biochemistry and Molecular Biology on the
available can be found in the Supplementary Data. The nomenclature and classification of enzyme-catalysed reactions.
growth of MACiE is shown in Figure 7 in terms of EC 4. Fleischmann,A., Darsow,M., Degtyarenko,K., Fleischmann,W.,

coverage and CATH coverage. Boyce,S., Axelsen,K., Bairoch,A., Schomburg,D., Tipton,K.F. and
Apweiler,R. (2004) IntEnz, the integrated relational enzyme database.
The statistics in MACiE can also be used to examine the Nucleic Acids Res., 32, D434–D437.
function and distribution of amino acid residues (G.L. Holliday, 5. Holliday,G.L., Bartlett,G.J., Almonacid,D.E., O’Boyle,N.M.,
D.E. Almonacid, J.M. Thornton and J.B.O. Mitchell, Murray-Rust,P., Thornton,J.M. and Mitchell,J.B.O. (2005) MACiE: a
manuscript in preparation) (see Figure 8), the distribution of database of enzyme reaction mechanisms. Bioinformatics, 21,
4315–4316.
mechanism and mechanism components and the bond order 6. Holliday,G.L., Mitchell,J.B.O. and Murray-Rust,P. (2004) CMLSnap:
changes occurring in each step of the reaction. animated reaction mechanisms. Internet J. Chem., 7, Article 4.
7. Holliday,G.L., Murray-Rust,P. and Rzepa,H.S. (2006) Chemical
Markup, XML, and the World Wide Web. 6. CMLReact, an
FUTURE DEVELOPMENTS XML vocabulary for chemical reactions. J. Chem. Inf. Model., 46,
145–157.
MACiE is a continually developing resource, and in the 8. Pegg,S.C.-H., Brown,S.D., Ojha,S., Seffernick,J., Meng,E.C.,
future we hope to include 3D data, which will incorporate Morris,J.H., Chang,P.J., Huang,C.C., Ferrin,T.E. and Babbitt,P.C.
(2006) Leveraging enzyme structure–function relationships for
various statistics and searches related to the analysis of functional inference and experimental design: the Structure–Function
these data. We will also continue to extend the coverage of Linkage Database. Biochemistry, 45, 2545–2555.
MACiE to include alternative reaction mechanisms that 9. Nagano,N. (2005) EzCatDB: the Enzyme Catalytic-mechanism
have been suggested for various enzymes, as well as new DataBase. Nucleic Acids Res., 33, D407–D412.
10. Berman,H.M., Henrick,K. and Nakamura,H. (2003) Announcing
mechanisms. Finally, we intend to build a user interface the worldwide Protein Data Bank. Nature Struct. Biol.,
which will allow for chemical diagrams to be drawn 10, 980.
and used to search MACiE, an entry process which is more 11. Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and
usable and also to implement the classiﬁcation of enzyme Thornton,J.M. (1997) CATH—a hierarchic classification of protein
mechanisms that we are developing. domain structures. Structure, 5, 1093–1108.
12. Martin,A.C. (2004) PDBSprotEC: a Web-accessible database linking
PDB chains to EC numbers via SwissProt. Bioinformatics, 20,
986–988.
SUPPLEMENTARY DATA 13. Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N.,
Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data
Supplementary Data are available at NAR Online. Bank. Nucleic Acids Res., 28, 235–242.
14. Laskowski,R.A., Chistyakov,V.V. and Thornton,J.M. (2005)
PDBsum more: new summaries and analyses of the known 3D
ACKNOWLEDGEMENTS structures of proteins and nucleic acids. Nucleic Acids Res., 33,
D266–D268.
We would like to thank the EPSRC (G.L.H. and J.B.O.M.), 15. Porter,C.T., Bartlett,G.J. and Thornton,J.M. (2004) The Catalytic Site
BBSRC (G.J.B. and J.M.T.—CASE studentship in associa- Atlas: a resource of catalytic sites and residues identified in enzymes
using structural data. Nucleic Acids Res., 32, D129–D133.
tion with Roche Products Ltd; N.M.O.B. and J.B.O.M.—grant 16. Golovin,A., Oldfield,T.J., Tate,J.G., Velankar,S., Barton,G.J.,
BB/C51320X/1), the Wellcome Trust, EMBL, IBM (G.L.H. Boutselakis,H., Dimitropoulos,D., Fillon,J., Hussain,A., Ionides,J.M.
and J.M.T.), the Chilean Government’s Ministerio de et al. (2004) E-MSD: an integrated data resource for bioinformatics.
´ ´
Planificacion y Cooperacion and the Cambridge Overseas Nucleic Acids Res., 32, D211–D216.
17. Bairoch,A., Apweiler,R., Wu,C.H., Barker,W.C., Boeckmann,B.,
Trust (D.E.A.) for funding and Unilever for supporting the Ferro,S., Gasteiger,E., Huang,H., Lopez,R., Magrane,M. et al. (2005)
Centre for Molecular Science Informatics. J.W.T. is funded The Universal Protein Resource (UniProt). Nucleic Acids Res., 33,
by a European Molecular Biology Laboratory studentship, D154–D159.


Vol. 22 no. 20 2006, pages 2565–2566
BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btl416

Data and text mining

PYCHEM: a multivariate analysis package for python
Roger M. Jarvis1,4,Ã, David Broadhurst1,4, Helen Johnson2, Noel M. O’Boyle3 and
Royston Goodacre1,4
1
School of Chemistry, The University of Manchester, PO Box 88, Sackville Street, Manchester M60 1QD, UK,
2
Faculty of Life Sciences, University of Manchester, Stopford Building, Oxford Road, Manchester M13 9PT, UK,
3
Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield
Road, CB2 1EW, UK and 4Manchester Interdisciplinary Biocentre, 131 Princess Street, Manchester M1 7DN, UK
Received on April 4, 2006; revised on July 5, 2006; accepted on July 26, 2006
Advance Access publication July 31, 2006
Associate Editor: Martin Bishop

trait(s); and discriminant analysis, for distinguishing between

ABSTRACT
Summary: We have implemented a multivariate statistical analysis different sample groups and for subsequent predictions on new
toolbox, with an optional standalone graphical user interface (GUI), samples. In fact, multivariate analysis encompasses many more
using the Python scripting language. This is a free and open source methods than these examples of linear modeling imply (Brereton,
project that addresses the need for a multivariate analysis toolbox 2003); but these tools are perhaps those most commonly used for the
in Python. Although the functionality provided does not cover the full modeling of biological data.
range of multivariate tools that are available, it has a broad comple- Many programs currently exist for multivariate analysis. Flexible
ment of methods that are widely used in the biological sciences. In environments for mathematical computing are available in the
contrast to tools like MATLAB, PyChem 2.0.0 is easily accessible form of MATLAB (The Mathworks, Natick, MA, USA), GNU
and free, allows for rapid extension using a range of Python modules Octave (http://www.octave.org/) which aims to be a free equivalent
and is part of the growing amount of complementary and interoperable of MATLAB, and R (http://www.r-project.org/); which has many
scientific software in Python based upon SciPy. One of the other bio-analysis modules, such as Vegan (for environmetrics) and
attractions of PyChem is that it is an open source project and so Bioconductor (for genomic analysis). These products provide
there is an opportunity, through collaboration, to increase the scope powerful tools for multivariate analysis through command line
of the software and to continually evolve a user-friendly platform that interpreters, which allow the user to perform their analysis with
has applicability across a wide range of analytical and post-genomic a great degree of flexibility. However, they require some investment
disciplines. in time to become familiar with the interpreters syntax, and are not
Availability: http://sourceforge.net/projects/pychem necessarily straightforward for people with little computa-
Contact: Roger.Jarvis@manchester.ac.uk or admin@pychem.org.uk tional experience. In addition, a number of graphical multivariate
Supplementary information: Further information is available from the ˚
software tools are also available; Evince (UmBio, Umea, Sweden),
project home page at http://pychem.sf.net/ whilst details of data gen- The Unscrambler (CAMO, Woodbridge, NJ, USA), Pirouette
eration are available at http://biospec.net/ (Infometrix, Bothell, WA, USA), S-Plus (Insightful, Seattle, WA.
˚
USA) and SIMCA (Umetrics, Umea, Sweden) are all good tools for
1 INTRODUCTION basic multivariate analysis although, with the exception of S-Plus,
they lack the flexibility of the interpreter style interfaces.
Increasingly in the life sciences many experiments generate data Thus there is currently a requirement for a flexible, extensible, free
which are of a multivariate nature, where many observations are and open source graphical environment for performing multivariate
recorded for each sample under analysis. Interpretation of such analysis, which can be used by both experts and casual users. The
complex data cannot generally be performed by taking a univariate increasing popularity of scripting languages such as Python (http://
approach, since no single measurement is necessarily adequate www.python.org/) within the life sciences community offers the
enough to describe the problem being addressed. In fact, the technology and critical mass for such a project. A platform of this
application of univariate methodology is in many cases totally inap- type addresses the requirements outlined above, with the additional
propriate as the complexity of information contained within large benefit that it allows for the rapid development of new cross-platform
biological datasets reflects the complexity of the system(s) being software approaches, and the integration of currently available soft-
studied. Typical multivariate analysis problems involve unsuper- ware libraries through application programming interfaces (APIs).
vised learning such as factor analysis, for reducing the dimension-
ality of data and modeling of variance; linear regression, for
2 THE MULTIVARIATE ANALYSIS TOOLBOX
formulating input to output transformation models based on super-
vised learning which are predictive generally for quantitative
FOR PYTHON
The PyChem project aims to provide a simple multivariate
Ã
To whom correspondence should be addressed. analysis toolbox with a powerful and intuitive GUI front-end.

Ó 2006 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commerical License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Bioinformatics. 2006, 22, 2565-2566.

R.M.Jarvis et al.

stored and the progression of the analysis recorded, which is
particularly useful for tracking data analyses as part of GLP. The
additional benefit of using the XML data structure for storage is that
it introduces the potential for engineering simple bespoke interfaces
to database storage systems.
PyChem provides simple grid-style user interfaces for the input of
experimental and sample metadata, so producing a series of vectors
describing the origin and identity of each sample and measured
variable. For unsupervised analyses, such as PCA, the software
simply requires a vector, or multiple vectors of sample labels for
plotting; in addition for supervised analyses, vectors are required to
(1) represent putative class structures or some quantitative trait
(e.g., level of abiotic or biotic interference) and (2) identify groups
in to which the data should be split for the purpose of cross-
validation. In supervised analyses the issue of model validation
is crucial; when a model is formulated there is a possibility that
Fig. 1. A screenshot demonstrating the feature selection functionality it will overfit the data and find a relationship between the data and

available in PyChem, in this example microarray data (Golub et al., 1999) the target class structure or dependent variables, which does not
have been analysed consisting of 72 samples represented by 7070 genes. hold for subsequent predictions; i.e. the model has learnt the training
The GA directed search can be used to highlight genes that are particularly data perfectly and is not able to generalize. This situation can be
important for discrimination. avoided by performing some form of model validation. In the
current version of PyChem (2.0.0) we use the preferred approach
The project is implemented in Python and utilizes the of data splitting (Brereton, 2003), which works by dividing the
wxPython (http://www.wxpython.org/), Boa Constructor (http:// measured X-variables in to three groups; a model training set,
boa-constructor.sourceforge.net/) and SciPy (http://scipy.org/) model cross-validation data and finally an independent test set.
packages (see Fig. 1 for an example screenshot) amongst others. The model is trained on the first set, optimized on the second set
The software was designed to provide a range of algorithms that and then tested for accuracy on the third set of ‘hold-out’ data.
address three fundamental questions commonly asked by the A major emphasis of this work has been in providing clear
researcher. and useful graphical reports for the interpretation of results. The
GUI uses wxPyPlot (http://www.cyberus.ca/~g_will/wxPython/
(1) What is the shape of the data—including sources of variance wxpyplot.html), with a small modification to include text plotting.
and outlier identification? In the future even more focus will be given to the structure of graphical
(2) How similar are different samples? reporting in PyChem, as well as the functionality associated with the
plotting canvases. Finally, all results, both graphical and numerical,
(3) Which measurements from the original data can be attributed
can easily be exported from PyChem, with numerical results in ASCII
to observed differences and/or similarities?
file format to allow for use in other software applications.
To help answer these questions, the initial release includes
algorithms for the pre-processing of multivariate data (such as ACKNOWLEDGEMENTS
scaling, baseline correction, filtering and derivatization), principal
R.M.J., D.B., H.J., N.M.O.B. and R.G. would like to thank the
components analysis (PCA) (Jolliffe, 1986), partial least squares
BBSRC for funding (NMOB; grant BB/C51320X/1). Funding
regression (PLS1) (Martens and Naes, 1989), discriminant function
to pay the Open Access publication charges for this article was
analysis (DFA) (Manly, 1994), cluster analysis [using the C clus-
provided by the BBSRC.
tering library for Python (http://bonsai.ims.u-tokyo.ac.jp/
~mdehoon/software/cluster/) (Eisen et al., 1998; de Hoon et al., Conflict of Interest: none declared.
2004)], and a number of genetic algorithm (GA) based tools for
performing feature selection (Jarvis and Goodacre, 2005), see Fig. 1. REFERENCES
The software is able to handle any 2D dataset where each sample
Brereton,R. (2003) Chemometrics: data analysis for the laboratory and chemical plant,
is defined by a series of discrete or continuous measurements. Data 1st edn. Chichester: John Wiley Sons Ltd.
can be imported from flat ASCII files that use the standard delim- Eisen,M. et al. (1998) Cluster analysis and display of genome-wide expression patterns.
iters. Typical data of this type include those generated from Proc. Natl Acad. Sci.USA, 95, 14863–14868.
microarrays, proteomics, spectroscopic methods (UV-Vis, infrared de Hoon,M. et al. (2004) Open source clustering software. Bioinformatics, 20,
1453–1454.
and Raman), mass spectrometry, NMR, or indeed any data arrays
Golub,T. et al. (1999) Molecular classification of cancer: class discovery and class
representing samples for which multiple discrete measurements prediction by gene expression monitoring. Science, 286, 531–537.
have been acquired. Once data have been imported into PyChem Jarvis,R. and Goodacre,R. (2005) Genetic algorithm optimization for pre-processing
they can be saved in an XML format [implemented using cElement- and variable selection of spectroscopic data. Bioinformatics, 21, 860–868.
Tree (http://effbot.org/)] as a PyChem experiment, which allows for Jolliffe,I.T. (1986) Principal Component Analysis. Springer-Verlag, New York.
Manly,B.F.J. (1994) Multivariate Statistical Methods: A Primer. Chapman Hall/
the subsequent storage of multiple experimental results within a CRC, New York.
single file. This allows for the capture of the state of the system Martens,H. and Naes,T. (1989) Multivariate Calibration. John Wiley Sons,
at a point in time, so that results of multivariate analyses can be Chichester.

2566

Bioinformatics. 2006, 22, 2565-2566.

Methodology Open Access
Simultaneous feature selection and parameter optimisation using
an artificial ant colony: case study of melting point prediction
Noel M O'Boyle*1,2, David S Palmer1,3, Florian Nigsch1 and
John BO Mitchell1

Address: 1Unilever Centre for Molecular Science Informatics, Dept. of Chemistry, University of Cambridge, Lensfield Rd, Cambridge, CB2 1EW,
UK, 2Cambridge Crystallographic Data Centre, 12 Union Rd, Cambridge, CB2 1EZ, UK and 3Department of Chemistry, Aarhus University, 8000
Aarhus C, Denmark
Email: Noel M O'Boyle* - baoilleach@gmail.com; David S Palmer - dsp@chem.au.dk; Florian Nigsch - fn211@cam.ac.uk;
John BO Mitchell - jbom1@cam.ac.uk

Published: 29 October 2008 Received: 1 August 2008
Accepted: 29 October 2008

Abstract
Background: We present a novel feature selection algorithm, Winnowing Artificial Ant Colony
(WAAC), that performs simultaneous feature selection and model parameter optimisation for the
development of predictive quantitative structure-property relationship (QSPR) models. The
WAAC algorithm is an extension of the modified ant colony algorithm of Shen et al. (J Chem Inf
Model 2005, 45: 1024–1029). We test the ability of the algorithm to develop a predictive partial
least squares model for the Karthikeyan dataset (J Chem Inf Model 2005, 45: 581–590) of melting
point values. We also test its ability to perform feature selection on a support vector machine
model for the same dataset.
Results: Starting from an initial set of 203 descriptors, the WAAC algorithm selected a PLS model
with 68 descriptors which has an RMSE on an external test set of 46.6°C and R2 of 0.51. The
number of components chosen for the model was 49, which was close to optimal for this feature
selection. The selected SVM model has 28 descriptors (cost of 5, ε of 0.21) and an RMSE of 45.1°C
and R2 of 0.54. This model outperforms a kNN model (RMSE of 48.3°C, R2 of 0.47) for the same
data and has similar performance to a Random Forest model (RMSE of 44.5°C, R2 of 0.55).
However it is much less prone to bias at the extremes of the range of melting points as shown by
the slope of the line through the residuals: -0.43 for WAAC/SVM, -0.53 for Random Forest.
Conclusion: With a careful choice of objective function, the WAAC algorithm can be used to
optimise machine learning and regression models that suffer from overfitting. Where model
parameters also need to be tuned, as is the case with support vector machine and partial least
squares models, it can optimise these simultaneously. The moving probabilities used by the
algorithm are easily interpreted in terms of the best and current models of the ants, and the
winnowing procedure promotes the removal of irrelevant descriptors.

Page 1 of 15


Background mate of how well the model will generalise to unseen data
Quantitative Structure-Activity and Structure-Property drawn from the same distribution. The purpose of the
Relationship (QSAR and QSPR) models are based upon search is to find the feature selection that optimises this
the idea, first proposed by Hansch [1], that a molecular value. The most well-known deterministic wrapper is
property can be related to physicochemical descriptors of sequential forward selection [11] (SFS) which involves
the molecule. A QSAR model for prediction must be able successive additions of the feature that most improves the
to generalise well to give accurate predictions on unseen objective function to the subset of descriptors already cho-
test data. Although it is true in general that the more sen. A related algorithm, sequential backwards elimina-
descriptors used to build a model, the better the model tion [12] (SBE), successively eliminates descriptors
predicts the training set data, such a model typically has starting from the complete set of descriptors. Both of these
very poor predictive ability when presented with unseen algorithms suffer from the problem of 'nesting'. In the
test data, a phenomenon known as overfitting [2]. Feature case of SFS, nesting refers to the fact that once a particular
selection refers to the problem of selecting a subset of the feature is added it cannot be removed at a later stage, even
descriptors which can be used to build a model with opti- if this would increase the value of the objective function.
mal predictive ability [3]. In addition to better prediction, More sophisticated methods, such as the sequential for-
the identification of relevant descriptors can give insight ward floating selection (SFFS) algorithm of Pudil et al.
into the factors affecting the property of interest. [13], include a backtracking phase after each addition
where variables are successively eliminated if this
The number of subsets of a set of n descriptors is 2n-1. improves the objective function. Wrapper methods spe-
Unless n is small (20) it is not feasible to test every pos- cific to certain models have also been developed. For
sible subset, and the number of descriptors calculated by example, the Recursive Feature Elimination algorithm of
cheminformatics software is usually much larger (CDK Guyon et al. [14] and the Incremental Regularised Risk
[4], MOE [5] and Sybyl [6] can respectively calculate a Minimisation of Fröhlich et al. [15] are specific to models
total of 95, 146 and 248 1D and 2D descriptors). Feature built using support vector machines.
selection methods can be divided into two main classes:
the filter approach and the wrapper approach [3,7,8]. The Stochastic wrappers attempt to deal with the size of the
filter approach does not take into account the particular search space by incorporating some degree of randomness
model being used for prediction, but rather attempts to into the search strategy. The most well known of these
determine a priori which descriptors are likely to contain algorithms is the genetic algorithm [16] (GA), whose
useful information. Examples of this approach include search procedure mimics the biological process of evolu-
ranking descriptors by their correlation with the target tion. A number of models are created randomly in the first
value or by estimates of the mutual information (based on generation, the best of which (as measured by the objec-
information theory) between each descriptor and the tive function) are selected and interbred in some way to
response. Another commonly used filter in QSAR is the create the next generation. A mutation operator is applied
removal of highly correlated (or anti-correlated) descrip- to the new models so that random sampling of the local
tors [9]. Liu [10] presents a comparison of five different space occurs. Over the course of many generations, the
filters in the context of prediction of binding affinities to objective function is optimised. Genetic algorithms were
thrombin. The filter approach has the advantages of speed first used for feature selection in QSAR by Rogers and
and simplicity, but the disadvantage that it does not Hopfinger [17] and are now used widely [9,18,19]. Other
explicitly consider the performance of the model contain- stochastic methods which have been used for feature
ing different features. Correlation criteria can only detect selection in QSAR are particle swarm optimisation [20,21]
linear dependencies between descriptor values and the and simulated annealing [22].
response, but the best performing QSAR models are often
non-linear (support vector machines (SVM), neural net- An additional difficulty in the development of QSAR
works (NN) and random forests (RF), for example). In models is the fact that some regression methods have
addition, Guyon and Elisseeff show that very high correla- parameters that need to be optimised to obtain the best
tion (or anti-correlation) does not necessarily imply an performance for a particular problem. The Support Vector
absence of feature complementarity, and also that two Machine (SVM) is an example of such a method. A SVM is
variables that are useless by themselves can be useful a kernel-based machine learning method used for both
together [3]. classification and regression [23-25] which has shown
very good performance in QSAR studies [9]. In ε-SVM
The wrapper approach conducts a search for a good fea- regression, the algorithm finds a hyperplane in a trans-
ture selection using the induction algorithm as a black box formed space of the inputs that has at most ε deviation
to evaluate subsets and calculate the value of an objective from the output y values. Deviations greater than ε are
function. The objective function should provide an esti- penalised by multiplying by a cost value C. The transfor-

Page 2 of 15


mation of the inputs is carried out by means of kernel Since the ANTSELECT algorithm uses only a single ant, it
functions, which allows nonlinear relationships between cannot make use of one of the most important features of
the inputs and the outputs to be handled by this essen- ant colony algorithms, collective intelligence. Instead,
tially linear method. For a particular problem and kernel, premature convergence will occur due to positive rein-
the values of C and ε must be tuned. forcement of models that have performed well earlier in
the local search. In addition, the search space will be
Here we describe WAAC, Winnowing Artificial Ant Col- poorly covered. Although the authors recommend that
ony, a stochastic wrapper for feature selection and param- the algorithm should be repeated several times to mini-
eter optimisation that combines simultaneous mise the likelihood of convergence to a poor local mini-
optimisation of the selected descriptors and the model mum, the use of an ant colony is a much more robust
parameters to create a model with good predictive accu- solution.
racy. This method does not require any pre-processing of
the data apart from removal of zero-variance and dupli- Shen et al. [26] presented an ACO algorithm that differed
cate descriptors. The only requirement is that allowed val- from ANTSELECT in several ways. Their algorithm, which
ues of parameters of the models must be specified. As a they called a modified ACO, is similar to our WAAC algo-
result, this method is suitable for use as an automatic gen- rithm in that it involves a colony of ants, each of which
erator of predictive models. remembers its best model and score, as well as its current
model and score. In Shen et al.'s algorithm, for every
The WAAC algorithm is a novel stochastic wrapper descriptor there are both positive and negative weights.
derived from the modified Ant Colony Optimisation The probability that an ant will choose a particular
(ACO) algorithm of Shen et al. [26]. Ant colony algo- descriptor is given by the positive weight for that descrip-
rithms take their inspiration from the foraging of ants tor divided by the sum of the positive and negative
whose cooperative behaviour enables the shortest path weights. After every iteration, the weights are reduced by
between nest and food to be found [27]. Ants deposit a multiplying by (1-ρ) as for ANTSELECT. The positive
substance called pheromone as they walk, thus forming a weight for a particular descriptor is increased by the sum
pheromone trail. At a branching point, an ant is more of the fitness scores of all ants in the current iteration that
likely to choose the trail with the greater amount of phe- have selected it, as well as the fitness scores of the best
romone. Over time as pheromones evaporate, only those models of all ants that have selected it in that model. Sim-
trails that have been reinforced by the passage of many ilarly, the negative weight for a particular descriptor is
ants will retain appreciable amounts of pheromone, with decreased by an amount based on the fitness scores of
the shortest trail having the greatest amount of pherom- models that have not selected it.
one. In the end, all of the ants will travel by the shortest
trail. Artificial ant colony systems may be used to solve In the following section, we describe the WAAC algorithm
combinatorial optimisation problems by making use of in detail, as well as the dataset and model used to test the
the ideas of cooperation between autonomous agents algorithm. In the Results and Discussion sections, we
through global knowledge and positive feedback that are describe the performance of the WAAC algorithm, com-
observed in real ant colonies [28]. pare it to other models on the same dataset, and discuss
some practical considerations in usage.
The first use of artificial ant systems for variable selection
in QSAR was the ANTSELECT algorithm of Izrailev and Methods
Agrafiotis [29]. The ANTSELECT algorithm involves the WAAC algorithm
movement of a single ant through feature space. Initially The WAAC algorithm uses a population of candidate
equal weights are assigned to each descriptor. The proba- models termed an 'ant colony'. Each ant represents a
bility of the ant choosing a particular descriptor in the model; that is, it is associated with a particular feature
next iteration is the weight for that descriptor divided by selection as well as particular values for the model (for
the sum of all weights. After the fitness of the model is example, SVM) parameters. The set of descriptors is stored
assessed, all of the weights are reduced by multiplying by as a binary fingerprint of length F (the number of descrip-
(1-ρ), where ρ is the evaporation coefficient. The weights tors), where a value of 1 for the nth bit indicates that the
of those descriptors selected in the current iteration are nth descriptor is selected, and 0 indicates that it is not. For
then increased by a constant multiple of the fitness score. each parameter of the model, a range of discrete values is
Gunturi et al. [30] used a modification of the ANTSELECT required. The parameter values used by a particular ant are
algorithm in a recent study of human serum albumin stored in a list of length P, where P is the number of
binding affinity in which the number of features selected adjustable parameters of the model. The fitness of each
was fixed a priori and, in addition, could not include model is measured using an objective function specified
descriptors that had a correlation coefficient greater than by the user.
0.75.

Page 3 of 15


The initial population of ants is randomly placed in fea- moving probability is used to determine the chance that a
ture and parameter space. The bits of the binary finger- particular ant will select a particular descriptor in the next
prints representing the feature selections are initialised to iteration. At the start of the optimisation phase, the mov-
either 0 or 1 with equal probability, so that on average ing probabilities for all of the descriptors will be approxi-
each ant corresponds to a model based on approximately mately equal to 0.5 (since the best model will be the
50% of the descriptors. Conversely, each descriptor is ini- current model and each descriptor is selected by approxi-
tially selected by approximately 50% of the ants. The ini- mately 50% of the ants).
tial parameter values for each ant are chosen at random
from the available values for each parameter. Similarly, for each parameter there is a moving probabil-
ity associated with every allowed value. These moving
Figure 1 shows a schematic of the WAAC algorithm. After probabilities sum to unity (since each ant needs to select
initialisation, the algorithm enters the optimisation exactly one allowed value for each parameter), and are cal-
phase. For each descriptor, a moving probability is calcu- culated by taking the average of the fraction of ants which
lated by taking the average of the fraction of ants which have currently selected a particular allowed value and the
have currently selected that descriptor and the fraction fraction of ants that have selected that value in their best
that have selected that descriptor in their best model. This model. At the start of the optimisation phase, each
allowed value of a parameter will be selected by approxi-
mately N/P ants where N is the number of ants, and P the
number of allowed values.

At the start of the optimisation phase, the ants move more
or less randomly, as the moving probabilities are essen-
tially equal for all features and parameter values. How-
ever, over the course of the optimisation phase as
particular descriptors are found to occur frequently in the
best models associated with the ants, due to positive feed-
back these descriptors will be more likely to be chosen in
subsequent iterations. This global optimisation procedure
is combined with local optimisation due to the influence
of the current positions of the ants on the moving proba-
bilities. Note that the ants do not move about relative to
their position in a previous iteration; rather, their subse-
quent location in feature space is determined by the best
and current feature selections of all of the ants. Note that
nesting is not a problem, as in each step of the optimisa-
tion the ants are free to explore descriptor combinations
which did not exist in the previous step.

After multiple iterations of the optimisation algorithm, a
winnowing procedure is applied. This reduces the search
space by retaining only those descriptors that have been
chosen by at least 20% of the ants in their best models,
and removing the rest. Parameter values are reinitialised
randomly. Some descriptors may be retained that do not
improve the models, but the subsequent reinitialisation
of the ants on the smaller search space will allow the sub-
sequent optimisation phase to identify better models
which exclude that descriptor. Note that no information is
carried from one optimisation procedure to the next. In
particular, memory of previous best models does not
guide future searching. This means that the randomly ini-
tialised models in the new optimisation phase are always
Figure of
Outline 1 the WAAC algorithm poorer than the best models of the previous phase, but the
Outline of the WAAC algorithm. reduction in the size of the feature space means that the

Page 4 of 15


performance of the model quickly recovers and matches
or improves on earlier performance. n

∑( y )
1 2
RMSE = obs
i − y ipred (1)
As shown in Figure 1, the optimisation phase and win- n
i =1
nowing procedure are repeated until convergence is
achieved or a specific number of iterations have occurred. n n

∑( y ) /∑( y )
The best model found at any point in the entire optimisa- 2 2
R2 = 1 − obs
i − y ipred obs
i − y obs
tion procedure should be chosen as the final best model.
i =1 i =1
An implementation of WAAC in R [31] is available from
the authors on request. (2)

n

∑( y )
Dataset 1
We use the Karthikeyan dataset [32] of melting point val- bias = obs
i − y ipred (3)
n
ues as described in Nigsch et al. [33]. This is a dataset of i =1
melting points of 4119 diverse organic molecules which In the prediction of the external test set, an outlier is
cover a range of melting points from 14 to 392.5°C, with defined as any point with a residual greater than 4 stand-
a mean of 167.3°C and a standard deviation of 66.4°C. ard deviations from the mean.
Each molecule is described by 203 2D and 3D descriptors,
which is the full range of descriptors available in the soft- Models
ware MOE 2004.03 [5]. We used the WAAC algorithm to simultaneously optimise
the chosen features and number of components in a Par-
The dataset was randomly divided 2:1 into training data tial Least Squares (PLS) model. The plsr method in the pls
and an external test set (1373 molecules, see additional package in R [31] was used to build the PLS model. Scal-
file 1: externaltest.csv for the original data). The training ing was set to true. A range of 20 allowed parameter values
data was further randomly divided 2:1 into a training set for the number of components in the model was initially
used for model building (1831 molecules, see additional set to cover from 1 to 191 inclusive in steps of 10. After
file 2: internaltraining.csv) and an internal test set (915 each winnowing, the step size was reset so that the maxi-
molecules, see additional file 3: internaltest.csv). mum value for the number of components was less than
the number of remaining descriptors. For the WAAC algo-
Objective function rithm itself, a colony of 50 ants was used, and the algo-
The goal of the WAAC algorithm is to find the feature sub- rithm was run for 800 iterations with winnowing every
set and parameter values that will give the best predictive 100 iterations. For comparison, the algorithm was run for
accuracy for a model based on given training data. During the same length without any winnowing.
the course of the optimisation, the algorithm needs to be
guided by an objective function that will give an estimate In addition, we used the WAAC algorithm to optimise a
of the predictive accuracy of a particular model. Support Vector Machine (SVM) model. The svm method
in the e1071 package in R [31] was used to perform ε-
Here we examine the performance of the WAAC algorithm regression with a radial basis function. A range of allowed
on the Karthikeyan dataset using as our objective function parameter values for the SVM were chosen based on a pre-
the root mean squared error of the predictions on the liminary run: values for C from 1 to 31 inclusive in steps
internal test set, RMSE(int). Each model is built on the of 2, and values of ε from 0.01 to 1.61 inclusive in steps
training set using whatever features and parameter values of 0.1. Since two parameters needed to be optimised for
have been selected, and then used to predict the melting this model, the length of each optimisation phase in the
point values for the internal test set. WAAC algorithm was extended to 150 iterations and the
algorithm was run for 1500 iterations in total.
Statistical testing
To assess the quality of a model, we report three statistics: To compare to other feature selection methods, we used
the squared correlation coefficient, R2, the Root-Mean- the training data to build a Random Forest model [34]
Square-Error, RMSE, and the bias. These are defined in using the randomForest package in R (using the default set-
Equations 1 to 3. A parenthesis nomenclature is used to tings of mtry = N/3, ntree = 500, nodesize = 5). We also
indicate whether the statistic refers to a model tested on compared to the best of thirteen k Nearest Neighbours
the entire training data (tr) (this includes the internal test (kNN) models trained on the training set, where k was 1,
set), the internal test set only (int), or the external test set 5, 10 or 15. For the models based on multiple neighbours,
(ext). separate models were created where the predictions were
combined using exponential, geometric, arithmetic, or
inverse distance weighting (for more details, see Nigsch et

Page 5 of 15


al. [33]). The best performing model, as measured by applied to the selected chromosomes, as a single-point
leave-one-out cross validation on the training data, was crossover between randomly selected (with replacement)
the 15 NN model with exponential weighting. Hereafter, chromosomes yielding a pair of children in each case.
this model is referred to as the kNN model. Each child was subject to a mutation operator which, for
a given bit on a chromosome, had a probability of 0.04 of
Genetic algorithm flipping it. The process of crossover and mutation was
For comparison with the WAAC algorithm, a genetic algo- repeated until 50 offspring were created. The next genera-
rithm for feature selection was implemented in the R sta- tion was then formed by the 25 best chromosomes in the
tistical programming environment [31]. 50 chromosomes original population along with the best 25 of the off-
were randomly initialised so that each chromosome on spring.
average corresponded to a model based on half of the
descriptors. A selection operator chose 10 chromosomes Results
using tournament selection with tournaments of size 3. The WAAC algorithm was used to search parameter and
Once selected, that chromosome was removed from the feature space for a predictive SVM model for the
pool for further selection. A crossover operator was Karthikeyan dataset for both a PLS model and an SVM

Valuemodel (bottom) function for the best model at each iteration of the WAAC algorithm for the PLS model (top) and the
Figure the
SVM of2 objective
Value of the objective function for the best model at each iteration of the WAAC algorithm for the PLS model
(top) and the SVM model (bottom). The figures on the right, (b) and (d), show the effect of having a single optimisation
phase without any winnowing. Ten repetitions of the algorithm are shown, with corresponding repetitions starting from the
same initial random seed.

Page 6 of 15


model. Figures 2(a) and 2(c) show the progress of the model. The final models were evaluated by training on the
algorithm for the PLS and SVM models respectively, as entire training data of 2746 molecules, and predicting the
measured by the value of the objective function for the melting point value of the external test set. The results are
best model found so far in a particular optimisation shown in Figure 3 and summarised in Table 2. The sum-
phase. Each experiment was performed 10 times with dif- mary statistics for the PLS model are: for the training set,
ferent random seeds. For each repetition, the model with RMSE(tr) = 44.4°C, R2(tr) = 0.52, bias = -0.0°C; for the
the lowest value of the objective function was chosen test set, RMSE(ext) = 46.6°C, R2(ext) = 0.51, bias = -
from among the best models found in each optimisation 0.74°C. For comparison, the value of the objective func-
phase. Of these ten models, the one with the fewest tion RMSE(int) was 42.8°C. There was a single outlier,
descriptors was chosen as the single final model. This mol4161 (Figure 4). The summary statistics for the SVM
reduces the possibility of finding by chance a model model are: for the training set, RMSE(tr) = 30.7°C, R2(tr)
which had an optimal value of the objective function but = 0.77, bias = -1.6°C; for the test set, RMSE(ext) = 45.1°C,
poor predictive ability. R2(ext) = 0.54, bias = -2.1°C. The value of the objective
function RMSE(int) was 40.2°C. Three molecules were
The selected models for WAAC/PLS and WAAC/SVM are identified as outliers to the model: mol41, mol4161 and
shown in Table 1. Of the 203 original descriptors, only 68 mol4195. These are drawn as filled circles in Figure 3, and
were selected for the PLS model, and 28 for the SVM their structures are shown in Figure 4.

Figure 3
Performance of models developed with WAAC: (a) a PLS model and (b) an SVM model
Performance of models developed with WAAC: (a) a PLS model and (b) an SVM model. The first two columns
contain predictions for the training set and test set, respectively. The line x = y is shown for comparison. The column on the
right shows the residuals from the test set prediction along with a line of best fit (light line); for comparison, the line x = 0 is
shown (heavy line). Outliers are shown as filled circles in the test set prediction and residuals plots. All values in °C.

Page 7 of 15


calculated the value of the objective function, RMSE(int).
As shown in Figure 5 (solid line), the value of the objec-
tive function obtained with 49 components is almost at
the minimum, although three larger values for the
number of components give slightly better models
(42.78°C RMSE(int) versus 42.73°C). For the SVM
model, the optimised parameter values associated with
the selected model were a cost value of 5, and a value for
ε of 0.21. When we carried out a parameter scan across all
allowed values of the cost and ε (272 models in total),
only one scored higher than the best model, and even
then, only marginally: 40.22°C RMSE(int) for cost = 5
and ε = 0.11, versus 40.23°C for the best model.

Figure 2(a) shows the value of the objective function for
the best PLS model at each iteration for the WAAC algo-
rithm compared to a single optimisation phase without
any winnowing, Figure 2(b). The same random seeds are
used for corresponding repetitions of the experiments, to
ensure that the effect observed is not due to different ini-
tial models. In the absence of winnowing, premature con-
vergence occurs and poorer solutions are found. This is
also the case for the best SVM model shown in Figure 2(c)
and 2(d).

The Random Forest (RF) and kNN models for the same
data are shown in Figure 6 and Table 1. Although per-
formance on the training set does not give any indication
of predictive ability, it is interesting to note how the differ-
ent models have completely different RMSE(tr) and
R2(tr). Performance on the external test set, which was not
used to derive any of the models, allows us to assess pre-
dictive ability. On the basis of RMSE(ext), the RF model
(44.5°C) is as good as, or slightly better than, the WAAC/
SVM model (45.1°C), followed by the WAAC/PLS model
(46.6°C) and then the kNN model (48.3°C). A similar
order of predictive ability is shown by R2(ext), (RF: 0.55,
WAAC/SVM: 0.54, WAAC/PLS: 0.51, kNN: 0.47). The bias
shows a slightly different order for the two WAAC-derived
models (RF: -0.4°C, WAAC/PLS: -0.7°C, WAAC/SVM: -
Figure 4
Structures of outliers for the models discussed in the text 2.1°C, kNN: -4.1°C).
Structures of outliers for the models discussed in the
text. An outlier is defined as any molecule with a residual However, looking at the test set predictions in the second
greater than four standard deviations from the mean. Mole-
column of Figures 3 and 6 it is clear, particularly for the RF
cules 41, 4161 and 4195 are outliers for the WAAC/SVM
model; molecules 4161 and 4208 are outliers for both the RF model, that a systematic error occurs at the extremes of the
and kNN models; molecule 4161 is the single outlier to the melting point values in the dataset: low values are system-
WAAC/PLS model. atically overpredicted, while high values are underpre-
dicted. In order to quantify the extent of this problem, we
plotted the test set residuals versus the experimental melt-
ing point, and used linear regression to find the line of
For the PLS model the optimised number of components best fit (shown in the third column in Figures 3 and 6).
was 49. In order to assess whether the WAAC algorithm For a model without this type of predictive bias, the
sufficiently explored parameter space, we carried out a expected slope is 0. The WAAC/SVM model performs best
parameter scan across all allowed values for the parameter with a slope of -0.43, followed by the kNN and WAAC/
with the feature selection found in the best model, and PLS models which both have slopes of -0.49, while the RF

Page 8 of 15


Table 1: Description of the best models found by the WAAC algorithm

WAAC/PLS WAAC/SVM

Number of descriptors 68 28

2D descriptors petitjean, weinerPath, weinerPol, a_ICM, radius, weinerPol, b_1rotR, b_rotR, chi1v_c, a_nO, a_nP, balabanJ,
b_1rotR, chi0_C, chi1, reactive, a_heavy, PEOE_VSA+2, PEOE_VSA+3, PEOE_VSA-1, PEOE_VSA-5, PEOE_VSA-6,
a_nH, a_nF, a_nO, a_nS, VadjEq, VadjMa, Q_RPC+, SlogP_VSA1, SlogP_VSA4, SlogP_VSA9, SMR_VSA2, SMR_VSA4,
balabanJ, PEOE_RPC+, PEOE_VSA+3, SMR_VSA6, TPSA
PEOE_VSA+4, PEOE_VSA+5,
PEOE_VSA+6, PEOE_VSA-1, PEOE_VSA-4,
PEOE_VSA_FPNEG, PEOE_VSA_PPOS,
PC+, PC-, Q_PC+, Q_RPC+,
Q_VSA_FHYD, Q_VSA_FNEG,
Q_VSA_FPNEG, Q_VSA_FPOL,
Q_VSA_FPOS, Q_VSA_FPPOS,
Q_VSA_PNEG, Q_VSA_PPOS, Kier1,
Kier3, KierA1, KierA2, apol, vsa_acc,
SlogP_VSA3, SlogP_VSA5, SMR_VSA3,
SMR_VSA5, TPSA

3D descriptors AM1_dipole, AM1_Eele, E_sol, E_strain, E_oop, E_strain, E_vdw, PM3_LUMO, FASA_P, FCASA+, rgyr
E_tor, MNDO_HF, MNDO_dipole,
MNDO_E, dipole, PM3_HF, ASA-, ASA_H,
CASA-, FASA_H, FASA_P, VSA, glob,
std_dim1, std_dim3, vol

Parameters components = 49 Cost = 5, ε = 0.21

model has a slope of -0.53. The standard errors of all of WAAC/PLS, WAAC/SVM, RF and kNN models respec-
these values are 0.01. tively. However, for the RF model the standard deviation
of the predicted values is much smaller than that of the
Another effect of this systematic error is that the predicted other models: 47.1, 51.6, 41.0 and 49.5°C for the WAAC/
values are bunched closer around the mean than the PLS, WAAC/SVM, RF and kNN models respectively.
experimental values. The mean and standard deviation of
the experimental values in the test set are 167.3°C and Another widely used stochastic method for feature selec-
66.4°C, respectively. All of the model predictions have a tion is a genetic algorithm (GA). Hasegawa et al. [35] were
similar mean: 166.5, 165.2, 167.0 and 163.2°C for the one of the first to use a GA in combination with a PLS

Table 2: Summary statistics for the models discussed in the text

WAAC/PLS WAAC/SVM SVM kNN Random Forest

Training set
RMSE (°C) 44.4 30.7 36.2 47.6 17.8 (44.7)*
R2 0.52 0.77 0.68 0.44 0.92 (0.51)*
bias (°C) 0.0 -1.6 -2.3 -3.4 0.0
Test set
RMSE (°C) 46.6 45.1 43.9 48.3 44.5
R2 0.51 0.54 0.56 0.47 0.55
bias (°C) -0.7 -2.1 -2.3 -4.1 -0.4
mean (°C) 166.5 165.2 165.0 163.2 167.0
standard deviation (°C) 47.1 51.6 49.3 49.5 41.0

Line of best fit through test set residuals
Slope -0.49 -0.43 -0.44 -0.49 -0.53

* Out-of-bag estimates for RMSE and R2 are shown in parenthesis.

Page 9 of 15


supposed to help strike a balance between exploitation of
information on previous models (global search) and
exploration of local feature space (local search). However,
this aspect is already included in Shen et al.'s algorithm
and WAAC by the influence of the best models (global
search) and current models (local search) on the moving
probabilities. As a result of this simpler approach, the
moving probabilities now have a meaningful interpreta-
tion: the probability of choosing a particular descriptor in
the next iteration is equal to the average of the fraction of
ants that have chosen that descriptor in their current
model and the fraction of ants that have chosen it in their
best model.

Since the WAAC algorithm requires a range of allowed
parameter values for the model, it is generally worthwhile
Figure PLS model
ability of5a of the number of components on the predictive
The effect to do an exploratory run of the algorithm to determine
The effect of the number of components on the pre- reasonable values. In addition, it is important that the
dictive ability of a PLS model. The red dashed line is a number of allowed values for each parameter is less than
model based on all of the features, whereas the model repre- the number of ants (preferably much less) to ensure that
sented by the blue solid line is based only on the subset the parameter space is adequately sampled. An appropri-
selected by the WAAC algorithm. The best subset line ends ate size for the ant population depends on the number of
at 59 components, as there are only 59 features in this sub- descriptors and the extent of the interaction between
set. The line for all features is truncated at 174 components them. Model space will be better sampled if more ants are
as the RMSE rapidly increases after this point.
used, but the calculation time will also increase. However,
since the feature-selection space is of size 2n-1, where n is
the number of descriptors, the exact number of ants is not
model to perform feature selection. The performance of expected to affect the ability of the algorithm to find solu-
the GA for feature selection is shown in Figure 7 compared tions. An ant population of between 50 and 100 ants is
to the WAAC algorithm. For both algorithms, the number recommended. For the WAAC/PLS study, the relationship
of PLS components was fixed at 49. Convergence is much between the population size and the best value of the
slower for the GA algorithm. In addition, the model with objective function is shown in Figure 8; there is little
the fewest number of descriptors from 10 repetitions of improvement beyond 50 ants. The length of the optimisa-
each algorithm had 95 descriptors in the case of GA/PLS tion phase should be sufficient to allow the objective
(objective function of 42.6°C) but only 57 for WAAC/PLS function to start to converge to an optimum value. It is not
(objective function value of 42.3°C). necessary to allow the optimisation phase to proceed
much further, as after this point the descriptors chosen in
Discussion the best models reinforce themselves and broad sampling
The development of the WAAC algorithm arose from an of the search space no longer occurs. The winnowing pro-
attempt to overcome the limitations of the modified ACO cedure and subsequent reinitialisation on a smaller search
and ANTSELECT algorithms. Both of these algorithms space is a more effective way of finding the optimum
determine probabilities by summing weights based on fit- model.
ness scores. However, we observed that as convergence is
achieved the fitness scores of the ant models in a particu- In the past, the development and comparison of feature
lar iteration differ very little from each other. Thus, WAAC selection methods for QSAR have involved the use of a
uses the fraction of the number of ants that have chosen a standard dataset first reported in 1990, the Selwood data-
particular descriptor rather than a function of the fitness set [36] of the activity of 31 antifilarial antimycin ana-
of the ants that have chosen that feature. Another problem logues, whose structures are represented by 53 calculated
with the use of weights is that they increase monotonically physicochemical descriptors. However, comparisons
over the course of the algorithm whereas the sum of the between different algorithms have been hampered by the
number of ants has a clear bound. In addition, WAAC uses fact that many of the descriptors are highly-correlated,
a value for ρ of 1, that is, complete evaporation. Values and in addition, a true test using an external test set is not
less than 1 were found to delay convergence without any feasible due to the small number of samples. Advances in
corresponding improvement in the result. This makes computing power mean that it is no longer appropriate to
sense when we consider that the evaporation parameter is use such a small dataset for the purposes of testing feature

Page 10 of 15


Performance of (a) a kNN model, and (b) a Random Forest model
Figure 6
Performance of (a) a kNN model, and (b) a Random Forest model. The first two columns contain predictions for the
training set and test set, respectively. The line x = y is shown for comparison. The column on the right shows the residuals
from the test set prediction along with a line of best fit (light line); for comparison, the line x = 0 is shown (heavy line). Outliers
are shown as filled circles in the test set prediction and residuals columns. All values in °C.

selection algorithms. The Karthikeyan dataset used here is then the performance of the PLS method is likely to suffer.
much more representative of the feature selection prob- This may explain why, despite containing fewer than half
lems that occur in modern QSAR and QSPR studies. the number of descriptors, the SVM model performed bet-
ter than the PLS model.
PLS models are prone to overfitting. Figure 5 shows a
comparison between a PLS model that uses the best subset Although the WAAC algorithm is capable of simultane-
(as selected by WAAC) and one using all of the descrip- ously optimising the feature selection as well as the
tors. It is clear that the development of a predictive PLS parameter values, in some instances it may be preferable
model requires a variable selection step. Even if the to use the WAAC algorithm simply for feature selection
number of components is optimised, performance is sig- and optimise the parameter values separately for each
nificantly poorer if all features are used instead of just the model. This will only be computationally feasible where
subset selected by the WAAC algorithm. It is also worth the model has a small number of parameters which need
noting that PLS is a linear method, whereas SVM is a non- to be optimised and where the parameter optimisation
linear method. If the underlying link between descriptor can be efficiently carried out. For example, the optimal
values and the melting point cannot be adequately number of components for a PLS model could be deter-
described by a linear combination of descriptor values, mined by internal cross validation. When compared to

Page 11 of 15


Figure the
Value of7 objective function for the best PLS model at each iteration of (a) a genetic algorithm and (b) the WAAC algorithm
Value of the objective function for the best PLS model at each iteration of (a) a genetic algorithm and (b) the
WAAC algorithm. Ten repetitions of each algorithm are shown. The number of PLS components was set to 49.

the use of a genetic algorithm for optimising the feature ent implementations as well as several parameters. This
selection of a PLS model, the WAAC algorithm performs result, on a single dataset, cannot therefore be seen as con-
well, both in terms of faster convergence and in its ability clusive.
to produce models with fewer descriptors. It should be
noted, however, that genetic algorithms have many differ- In comparison to PLS models, the inclusion of a large
number of descriptors does not necessarily lead to overfit-
ting for SVM models. Although both Guyon et al. [14] and
Fröhlich et al. [15], for example, have developed descrip-
tor selection methods for SVM, an SVM model built on
the entire set of descriptors and using the optimized
parameters from the WAAC algorithm actually performs
slightly better on the external test set. Here, the main effect
of the WAAC algorithm is the identification of a mini-
mum subset of descriptors which are the most important
for the development of a predictive model. Such a proce-
dure is especially useful when the descriptor values are
derived from experimental measurement or require
expensive calculation (for example, those derived from
QM calculations). It also aids interpretability of the
results.

Of the 28 descriptors selected by the WAAC/SVM model,
Figure the
value of 8 between the population WAAC/PLS model
Relationship objective function for thesize and the minimum three-quarters are 2D descriptors. Of these, many involve
Relationship between the population size and the the area of the van der Waals surface associated with par-
minimum value of the objective function for the ticular property values. For example the PEOE_VSA+2
WAAC/PLS model. The value of the objective function is descriptor is the van der Waals surface area (VSA) associ-
the minimum found from ten repetitions of the algorithm. ated with PEOE (Partial Equalisation of Orbital Electron-
egativity) charges in the range 0.10 to 0.15. Also selected

Page 12 of 15


were descriptors relating to hydrophobic patches on the mol4195, m.p. 342°C, but predicted 111°C. Both of these
VSA (SlogP_VSA1, for example), the contribution to molecules have extended conjugated structures, causing
molar refractivity (SMR_VSA2, for example) which is the molecule to be planar over a wide area, and which are
related to polarisability, and the polar surface area (TPSA). likely to give rise to extensive π-π stacking in the solid
Since the intermolecular interactions in a crystal lattice are state. As a result, they are conformationally less flexible
dependent on complementarity between the properties of than might be expected from the number of rotatable
the VSA of adjacent molecules, the selection of these bonds. mol4161 is also an outlier to the other three mod-
descriptors seems reasonable. Two descriptors were els; for WAAC/PLS it is the only outlier, whereas the RF
selected relating to the number of rotatable bonds and kNN predictions have a second outlier, mol4208 (Fig-
(b_1rotR and b_rotR). These properties are related to the ure 4).
melting point through their effect on the change in
entropy (ΔSfus) associated with the transformation to the The WAAC algorithm described here is particularly useful
solid state. Hydrogen bonds make an important energetic when a machine learning method is prone to overfitting if
contribution to the formation of the crystal structure. This presented with a large number of descriptors, such as is
probably explains the selection of the descriptor for the the case with PLS. However, not all machine learning
number of oxygen atoms (a_nO), although strangely the methods require a prior feature selection procedure. The
number of nitrogen atoms is not included (it was however Random Forest (RF) method of Breiman uses consensus
included in five out of the ten ant models). Four descrip- prediction of multiple decision trees built with subsets of
tors were selected by all ten ant models: b_1rotR, the data and descriptors to avoid overfitting. For compar-
SlogP_VSA1, PEOE_VSA-6 and balabanJ. Balaban's J ison with the WAAC results, we predicted the melting
index is a topological index that increases in value as a point values for the external test data using an RF model
molecule becomes more branched [37]. It seems possible built on the training data. We also compared to a 15 Near-
that increased branching makes packing more difficult, est Neighbour model (kNN) where the predictions of the
and leads to lower melting points. set of neighbours were combined using an exponential
weighting. In our comparison, the RMSE(ext) and R2(ext)
The WAAC algorithm appears to be robust to the presence show that the RF and WAAC/SVM models are very similar,
of highly correlated descriptors. Despite the fact that such and are better than the WAAC/PLS and kNN models.
descriptors were not filtered from the dataset, the selected However, analysis of the residuals shows that the RF is
WAAV/SVM model contains only two pairs of descriptors more prone to bias at high and low values of the melting
with an absolute Pearson correlation coefficient greater point compared to the other models.
than 0.8: b_rotR/b_1rotR (0.97) and SMR_VSA2/
PEOE_VSA-5 (0.81). If the WAAC algorithm were unable A predictive bias was observed for all models at the
to filter highly correlated descriptors, we would expect to extremes of the range of melting points. A similar effect
see many more correlations as 16 of the chosen descrip- was observed by Nigsch et al. for a kNN model of melting
tors were highly correlated (absolute value greater than point prediction [33]. The effect was attributed to the fact
0.8) with at least one descriptor not included in the final that the density of points in the training set is less at the
model. For example, radius has a correlation of 0.86 with extremes of the range of melting point values. This means
respect to diameter (not unexpectedly). weinerPol is that the nearest neighbours to a point near the extreme are
highly correlated with 35 other descriptors, none of which more likely to have melting points closer to the mean.
were chosen in the final model. PM3_LUMO is correlated This effect is most pronounced for the RF model, and the
with both AM1_LUMO (0.97) and MNDO_LUMO explanation may be similar.
(0.96), but neither of other two appear.
In this study the WAAC algorithm was guided using the
For a small number of molecules, our models make very RMSE of prediction for an internal test set, RMSE(int). The
poor predictions. This may either be due to a lack of suffi- choice of which objective function to use should be con-
cient training molecules with particular characteristics, or sidered carefully. If an objective function is chosen which
it may be due to a fundamental deficiency in the informa- does not explicitly penalise the number of descriptors but
tion used to build the models. For example, for the only does so implicitly (for example, RMSE(int)), irrele-
WAAC/SVM models, three outliers can be detected whose vant descriptors may accumulate in the converged model.
residuals are more than four standard deviations from the When using such an objective function, the winnowing
mean (Figures 3 and 4). A polyfluorinated amide, mol41, procedure implemented in WAAC plays an important role
is predicted to have a melting point of 233°C although its in removing these descriptors after the optimisation phase
experimental melting point is 44°C. The melting points of by initiating a new search of a reduced feature space which
the other two outliers were both underestimated: makes it less likely that irrelevant descriptors will be
mol4161, m.p. 314.5°C but predicted 119°C, and selected. This effect is shown in Figure 2(b) and 2(d),

Page 13 of 15


where poorer models were found when the WAAC feature We have presented WAAC, an extension of the modified
selection and parameter optimisation procedure was ACO algorithm of Shen et al. [26], which can perform
applied without winnowing. simultaneous optimisation of feature selection and model
parameters. In addition, the moving probabilities used by
An alternative type of objective function is one that explic- the algorithm are easily interpreted in terms of the best
itly penalises the number of descriptors. Such functions and current models of the ants, and our winnowing pro-
typically contain a cost term which is adjusted based on cedure promotes the removal of irrelevant descriptors.
some a priori knowledge of the number of descriptors
desired in the model. For example, the modified ACO We have shown that the WAAC algorithm can be used to
algorithm of Shen et al. [26] was guided by a fitness func- simultaneously optimise parameter values and the
tion with two terms, one relating to the number of selected features for PLS and SVM models for melting
descriptors and the other to the fit of the model to the point prediction. In particular, the resulting SVM model
training set. Objective functions such as this quickly force based on 28 descriptors performed as well as a Random
models into a reduced feature space by favouring models Forest model that used the entire set of 203 descriptors.
with fewer descriptors. However, the moving probabilities
used to choose descriptors will be misleading as they will Authors' contributions
largely be based on those descriptors present in models NMOB conceived and developed the WAAC algorithm,
with fewer descriptors rather than those with the best pre- applied it to the melting point dataset, analysed the results
dictive ability. As a result, descriptors with good predictive and drafted the manuscript. DSP was involved in the
ability may be removed by chance. It should be noted that interpretation of the results, revising the manuscript and
an objective function that simply optimises a measure of carried out the Random Forest calculations. FN imple-
fit to the training data is not a suitable choice for the mented the kNN model. JBOM contributed to the analysis
development of a model with predictive ability. Optimis- of data and revising the manuscript. All authors read and
ing the RMSE on the entire training data, RMSE(tr), or approved the final manuscript.
optimising the R2(tr) value, will produce an overfitted
model that fits the training data exceptionally well but Additional material
performs poorly on unseen data.

Near the end of each optimisation phase, the majority of Additional file 1
The external test set. The models were evaluated by testing on this exter-
ants converge to the same feature selection and parameter nal test set.
values, causing the same model to be repeatedly evalu- Click here for file
ated. It should be possible to gain a significant speedup if [http://www.biomedcentral.com/content/supplementary/1752-
instead of re-evaluating a model, a cached value were 153X-2-21-S1.csv]
used. Caching could be simply done by storing the objec-
tive function and models for all of the ants from the last Additional file 2
few iterations. This is especially important if an objective The internal training set. The WAAC feature selection algorithm was
trained on this.
function is used whose value varies on re-evaluation as is Click here for file
the case, for example, with the RMSE from n-fold cross- [http://www.biomedcentral.com/content/supplementary/1752-
validation, RMSE(cv). Since for each ant the best score is 153X-2-21-S2.csv]
retained, the value of the objective function will tend
towards the optimistic tail of the distribution of values of Additional file 3
the RMSE(cv). However, it should not have a major effect The internal test set. The objective function used to guide the WAAC fea-
on the results of the feature selection and parameter opti- ture selection algorithm was calculated using this internal test set.
Click here for file
misation, as model re-evaluation generally occurs only
once the majority of the ants' models have already con- 153X-2-21-S3.csv]
verged.

Conclusion
The key elements to developing an effective QSPR model Acknowledgements
for prediction are accurate data, relevant descriptors and We thank the BBSRC (NMOB and JBOM – grant BB/C51320X/1), Pfizer
an appropriate model. Where there is no a priori informa- (DSP and JBOM – through the Pfizer Institute for Pharmaceutical Materials
tion available on relevant descriptors, some form of fea- Science), and Unilever for funding FN and JBOM and for supporting the
ture selection needs to be performed. Centre for Molecular Science Informatics. NMOB thanks Dr. Jen Ryder, Dr.
Daniel Almonacid and Dr. Avril Coghlan for helpful comments on the man-
uscript.

Page 14 of 15


References 27. Goss S, Aron S, Deneubourg JL, Pasteels JM: Self-organized short-
1. Hansch C, Maloney PP, Fujita T, Muir RM: Correlation of biologi- cuts in the Argentine ant. Naturwissenschaften 1989, 76:579-581.
cal activity of phenoxyacetic acids with Hammett substitu- 28. Dorigo M, Di Caro G, Gambardella LM: Ant algorithms for dis-
ent constants and partition coefficients. Nature 1962, crete optimization. Artif Life 1999, 5:137-172.
194:178-180. 29. Izrailev S, Agrafiotis DK: Variable selection for QSAR by artifi-
2. Hawkins DM: The problem of overfitting. J Chem Inf Comput Sci cial ant colony systems. SAR QSAR Environ Res 2002, 13:417-423.
2004, 44:1-12. 30. Gunturi SB, Narayanan R, Khandelwal A: In silico ADME model-
3. Guyon I, Elisseeff A: An introduction to variable and feature ling 2: Computational models to predict human serum albu-
selection. J Mach Learn Res 2003, 3:1157-1182. min binding affinity using ant colony systems. Bioinorg Med
4. Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E: Chem 2006, 14:4118-4129.
The Chemistry Development Kit (CDK): An Open-Source 31. R: A Language and Environment for Statistical Computing
Java Library for Chemo- and Bioinformatics. J Chem Inf Comput 2006 [http://www.R-project.org]. R Foundation for Statistical Com-
Sci 2003, 43:493-500. puting, Vienna, Austria
5. MOE (Molecular Operating Environment), v2004.03 [http://www.chem 32. Karthikeyan M, Glen RC, Bender A: General melting point pred-
comp.com]. Chemical Computing Group Inc., Montreal, Quebec, ication based on a diverse compound data set and artificial
Canada neural networks. J Chem Inf Model 2005, 45:581-590.
6. 2006 [http://www.tripos.com]. SYBYL 7.1. Tripos Inc., 1699 Hanley 33. Nigsch F, Bender A, van Buuren B, Tissen J, Nigsch E, Mitchell JBO:
Road, St. Louis, MO 63144 Melting point prediction employing k-nearest neighbour
7. John GH, Kohavi R, Pfleger K: Irrelevant features and the subset algorithms and genetic parameter optimization. J Chem Inf
selection problem. In Machine learning, Proceedings of the Eleventh Model 2006, 46:2412-2422.
International Conference: 10–13 July 1994; Amherst Edited by: Cohen 34. Breiman L: Random Forests. Mach Learn 2001, 45:5-32.
WW, Hirsh H. Morgan Kaufmann; 1994:121-129. 35. Hasegawa K, Miyashita Y, Funatsu K: GA Strategy for Variable
8. Kohavi R, John GH: Wrappers for feature subset selection. Artif Selection in QSAR Studies: GA-Based PLS Analysis of Cal-
Intell 1997, 97:273-324. cium Channel Antagonists. J Chem Inf Comput Sci 1997,
9. Dudek AZ, Arodz T, Gálvez J: Computational methods in devel- 37:306-310.
oping quantitative structure-activity relationships (QSAR): a 36. Selwood DL, Livingstone DJ, Comley JCW, O'Dowd AB, Hudson AT,
review. Comb Chem High Through Screen 2006, 9:213-228. Jackson P, Jandu KS, Rose VS, Stables JN: Structure-activity rela-
10. Liu Y: A comparative study on feature selection methods for tionships of antifilarial antimycin analogues: a multivariate
drug discovery. J Chem Inf Comput Sci 2004, 44:1823-1828. pattern recognition study. J Med Chem 1990, 33:136-142.
11. Whitney AW: A direct method of nonparametric measure- 37. Balaban AT: Highly discriminating distance-based topological
ment selection. IEEE Trans Comput 1971, 20:1100-1103. index. Chem Phys Lett 1982, 89:399-404.
12. Marill T, Green DM: On the effectiveness of receptors in recog-
nition systems. IEEE Trans Inform Theory 1963, 9:11-17.
13. Pudil P, Novovièová J, Kittler J: Floating search methods in fea-
ture selection. Patt Recog Lett 1994, 15:1119-1125.
14. Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer
classification using support vector machines. Mach Learn 2002,
46:389-422.
15. Fröhlich H, Wegner JK, Zell A: Towards optimal descriptor sub-
set selection with support vector machines in classification
and regression. QSAR Comb Sci 2004, 23:311-318.
16. Goldberg DE: Genetic Algorithms in Search, Optimization and Machine
Learning Boston: Kluwer Academic Publishers; 1989.
17. Rogers D, Hopfinger AJ: Application of genetic function approx-
imation to quantitative structure-activity relationships and
quantitative structure-property relationships. J Chem Inf Com-
put Sci 1994, 34:854-866.
18. Wegner JK, Zell A: Prediction of aqueous solubility and parti-
tion coefficient optimized by a genetic algorithm based
descriptor selection method. J Chem Inf Comput Sci 2003,
43:1077-1084.
19. von Homeyer A: Evolutionary Algorithms and their Applica-
tions in Chemistry. In Handbook of Chemoinformatics Volume 3.
Edited by: Gasteiger J. Weinheim: Wiley-VCH; 2003:1239-1280.
20. Agrafiotis DK, Cedeno W: Feature selection for structure-activ-
ity correlation using binary particle swarms. J Med Chem 2002,
45:1098-1107.
21. Lin WQ, Jiang JH, Shen Q, Shen GL, Yu RQ: Optimized block-wise
variable combination by particle swarm optimization for
partial least squares modeling in quantitative structure- Publish with ChemistryCentral and every
activity relationship studies. J Chem Inf Model 2005, 45:486-493.
22. Guha R, Jurs PC: Development of linear, ensemble and nonlin- scientist can read your work free of charge
ear models for the prediction and interpretation of the bio-
logical activity of a set of PDGFR inhibitors. J Chem Inf Comput Open access provides opportunities to our
Sci 2004, 44:2179-2189. colleagues in other parts of the globe, by allowing
23. Vapnik VN: The nature of statistical learning theory New York: Springer anyone to view the content free of charge.
Verlag; 1995.
24. Hastie T, Tibshirani R, Friedman J: The elements of statistical learning: W. Jeffery Hurst, The Hershey Company.
data mining, inference, and prediction New York: Springer; 2001.
25. Smola AJ, Schölkopf B: A tutorial on support vector regression.
Stat Comput 2004, 14:199-222. peer reviewed and published immediately upon acceptance
26. Shen Q, Jiang JH, Tao JC, Shen GL, Yu RQ: Modified Ant Colony cited in PubMed and archived on PubMed Central
Optimization Algorithm for Variable Selection in QSAR yours you keep the copyright
Modeling: QSAR Studies of Cyclooxygenase Inhibitors. J
Chem Inf Model 2005, 45:1024-1029. Submit your manuscript here:

Page 15 of 15

BMC Bioinformatics BioMed Central

Userscripts for the Life Sciences
Egon L Willighagen*1, Noel M O'Boyle2, Harini Gopalakrishnan3,
Dazhi Jiao3, Rajarshi Guha3, Christoph Steinbeck4 and David J Wild3

Address: 1Cologne University Bioinformatics Center, Cologne University, Cologne, Germany, 2Cambridge Crystallographic Data Centre,
Cambridge, UK, 3School of Informatics, Indiana University, Bloomington, USA and 4Wilhelm-Schickard-Institut, Center for Bioinformatics,
University of Tübingen, Tübingen, Germany
Email: Egon L Willighagen* - egonw@users.sf.net; Noel M O'Boyle - baoilleach@gmail.com; Harini Gopalakrishnan - hgopalak@indiana.edu;
Dazhi Jiao - djiao@indiana.edu; Rajarshi Guha - rguha@indiana.edu; Christoph Steinbeck - c.steinbeck@steinbeck-molecular.de;
David J Wild - djwild@indiana.edu

Published: 21 December 2007 Received: 31 August 2007
Accepted: 21 December 2007
BMC Bioinformatics 2007, 8:487 doi:10.1186/1471-2105-8-487
This article is available from: http://www.biomedcentral.com/1471-2105/8/487
© 2007 Willighagen et al; licensee BioMed Central Ltd.

Abstract
Background: The web has seen an explosion of chemistry and biology related resources in the
last 15 years: thousands of scientific journals, databases, wikis, blogs and resources are available
with a wide variety of types of information. There is a huge need to aggregate and organise this
information. However, the sheer number of resources makes it unrealistic to link them all in a
centralised manner. Instead, search engines to find information in those resources flourish, and
formal languages like Resource Description Framework and Web Ontology Language are
increasingly used to allow linking of resources. A recent development is the use of userscripts to
change the appearance of web pages, by on-the-fly modification of the web content. This opens
possibilities to aggregate information and computational results from different web resources into
the web page of one of those resources.
Results: Several userscripts are presented that enrich biology and chemistry related web
resources by incorporating or linking to other computational or data sources on the web. The
scripts make use of Greasemonkey-like plugins for web browsers and are written in JavaScript.
Information from third-party resources are extracted using open Application Programming
Interfaces, while common Universal Resource Locator schemes are used to make deep links to
related information in that external resource. The userscripts presented here use a variety of
techniques and resources, and show the potential of such scripts.
Conclusion: This paper discusses a number of userscripts that aggregate information from two or
more web resources. Examples are shown that enrich web pages with information from other
resources, and show how information from web pages can be used to link to, search, and process
information in other resources. Due to the nature of userscripts, scientists are able to select those
scripts they find useful on a daily basis, as the scripts run directly in their own web browser rather
than on the web server. This flexibility allows the scientists to tune the features of web resources
to optimise their productivity.

Page 1 of 12
BMC Bioinformatcs. 2007, 8, 487. (page number not for citation purposes)

BMC Bioinformatics 2007, 8:487 http://www.biomedcentral.com/1471-2105/8/487

Background as identifiers [13], indicating that a specific database entry
The web has seen an explosion of chemistry and biology is related to the cited term in the ontology, and therefore
related resources in the last 15 years: thousands of scien- related to entries from other databases annotated with
tific journals, databases, wikis, blogs, and regular HTML that term.
pages are available containing information relevant to
chemists and biologists [1-4]. While each of those Identifiers that can be calculated algorithmically are even
resources is valuable in itself, integrating information better, because they do not need to be looked up in a list
from these resources increases the value even more: for of identifiers. Instead, anyone can calculate them from the
example, PubChem provides a wealth of data but could be object itself. For example, for molecular structures the
complemented with 3D models to create an even richer InChI [14] is the ideal replacement for database specific
information source. identifiers such as the CAS registration number, the
PubChem compound identifier and the ChEBI identifier.
The original goal of the world wide web was to hyperlink These all require a look up or conversion table to convert
individual web pages allowing humans to explore a web one identifier into another. Using the InChI, one can look
of knowledge. For individual web pages these links can be up information in all databases without having to know
created manually, as is still done in blogs, wikis, and static the database specific identifier.
HTML pages; for large databases this is, however, not fea-
sible. Userscripts are small programs that can alter the In addition to the unique identifier, one additional func-
HTML content rendered by web browsers. For example, a tionality is needed to create a link to a particular database:
userscript may add book prices from competitors to the the database must provide either an API (Application Pro-
Amazon.com website, or may remove unwanted adver- gramming Interface) which can be queried using the iden-
tisements from a site. Using the same approach, user- tifier or else provide a uniform scheme for deep linking to
scripts can also solve the problem of interlinking web a web page containing information about the entry
resources, by adding to web pages of one resource dynam- behind the identifier. For example, looking up structures
ically generated hyperlinks into another. By selecting a in PubChem is done with a scheme in which the InChI is
specific set of userscripts, the user can tune a website to embedded verbatim. To look up the structure of methane
provide all kinds of facilities not anticipated by the origi- (InChI=1/CH4/h1H4), the URL
nal author of the site. For example, userscripts have been http:www.ncbi.nlm.nih.gov/entrequery.fcgi?CMD=searc
used in bioinformatics to enhance the iHOP web page [5]: hDB=pccompoundterm=%22InChI=1/CH4/h1H4%2
the script extracts user assigned tags from a third party 2[InChI] is used.
resource, and shows them as a tag cloud on iHOP pages
for particular genes. The plethora of resources is overwhelming, and both users
and database developers may have preferred subsets, e.g.
Automatic hyperlinking is only possible though the use of more trusted, resources. It is therefore worthwhile to have
unique identifiers such as the PDB ID, the CAS registra- a system that allows users to choose which resources they
tion number and, more recently, the IUPAC International want to have linked with which other resources. User-
Chemical Identifier (InChI). While identifiers are easily scripts provide the necessary technology to allow this
used to connect databases, such as done in the SRS system within web browsers. Here we describe several userscripts
[6] or in meta database software like BioWarehouse [7], we have developed to create links between web resources
the sheer number of web resources makes it impossible of interest to researchers in the life sciences.
integrate all resources. Consequently, (bio)chemical
search engines, such as ChemSpider [8] and tools to har- Implementation
vest information from web resources, such as ChemX- We use the following techniques to link various web
treme [9] and BioSpider [10,11], as well as systems that resources in this paper: userscripts, unique identifiers,
standardize algorithmic access to resources and services, microformats, and web resource interfaces. The following
such as BioMOBY [12], have emerged. sections describe how these are used in this work.

Another reason why identifiers do not always allow link- Userscripts
ing resources is that many of them are database specific, A userscript is a small program written in JavaScript that is
such as the PDB ID and the Digital Object Identifier automatically run within a web browser (often by a plugin
(DOI), and sometimes even restricted in being used, as or add-on) when the user accesses pages that match a par-
with the CAS registration number. Open standard identi- ticular URL. Userscripts allow the user to modify the
fiers address this problem. Such identifiers can be derived HTML content of a web page on-the-fly, by adding or
from ontologies, dictionaries, encyclopedia, or computed removing elements or by moving them around. For exam-
by an algorithm. The Gene Ontology terms are often used ple, userscripts exist that remove pop-up advertisements

Page 2 of 12


from web pages, and that alter the Amazon.com web page expressions to find certain strings in the text of the web
to provide book prices from alternative suppliers. A repos- page. This works particularly well for identifiers with a
itory of userscripts exists at userscripts.org/citeuserscripts- unique and well described syntax. For example, a regular
dotorg. Chemists and biologists can find relevant expression for InChIs will have fewer false positives than
userscripts by searching with the terms chemistry or one for PDB identifiers.
biology.
As with any program that you run on your computer, it is
Of the popular web browsers, only Opera provides built- important to consider security when installing userscripts.
in support for userscripts (referred to as 'User JavaScript'). Although the security model used by Greasemonkey pre-
To enable userscript support in other browsers, a third- vents attacks by malicious websites, it is unable to detect
party extension needs to be installed: Greasemonkey [15] or prevent the user himself installing a malicious user-
for Firefox, Creammonkey [16] for Safari, IE7pro [17] or script. Such scripts do exist; recently, malicious userscripts
Turnabout [18] for Internet Explorer. The userscripts pre- were uploaded to Userscripts.org that attempted to steal
sented in the Results section are targeted at Greasemon- information from users' cookies. In that case, once the
key, although it should be possible to run them in any problem was discovered the malicious userscripts were
browser with only minor changes. easily detected and removed by the administrator. We rec-
ommend that unless you are familiar with JavaScript and
The web browser user has full control over which user- carefully inspect the source code, you should only install
scripts she wants to have installed, allowing her to cus- userscripts from a trusted source.
tomise web pages exactly the way she wishes. Once
installed, it is possible to individually enable or disable Unique identifiers
installed scripts. For example, for Greasemonkey see the Recognition of biological and chemistry relevant informa-
Manage User Scripts option in the Tools menu under tion on web pages is simplified by using identifiers [19].
Greasemonkey, or to disable the extension completely, Such identifiers may or may not be marked up with
click on the Greasemonkey icon in the status bar. Further semantic markup such as microformats (see below). Iden-
control is provided by specifying to which web pages the tifiers are widely used to make connections between data-
script applies. Userscripts define default rules (e.g. http:// bases, and often identify a specific entry in a database.
www.biomedcentral.com/), but the user is normally able Some examples of this are the PDB identifier, Digital
to override these. Object Identifiers, PubChem compound identifier, and
the CAS registry number for, respectively the PDB, DOI,
The userscript has two main methods to find the HTML PubChem, and the Chemistry Abstract Service databases.
content to which to add or remove elements. The most In this study we use DOIs, InChIs, and PDB identifiers as
accurate one is to analyse the document object model our unique identifiers (Table 1).
(DOM). This approach is used by the Sechemtic userscript
to find uses of chemical microformats (see example below
under Microformats). The other method is to use regular

Table 1: Userscripts for the life sciences. A summary of the resources and identifiers used by userscripts for the life sciences. The
Identification method indicates how the userscript recognises relevant information on a web page. The Identifiers column describes
the unique identifier searched for. The Resources column indicates the web resource to which a link is created, or from which data is
extracted.

Technologies and Resources used

Name Identification method Identifiers Resources

Jmol4PubChem HTML tags on PubChem PubChem ID Pub3D [36]
OSCAR3 on HTML natural language processing chemical structure name -
PDB-Jmol regular expression PDB ID First Glance in Jmol [41]
Sechemtic microformats InChI, SMILES, CAS number PubChem [32]
eMolecules [43]
Google [54]
Add quotes to DOIs regular expression DOI Postgenomic [3]
Chemical blogspace [4]
Add quotes to molecules microformats InChI Chemical blogspace [4]
Add to Connotea regular expression DOI Connotea [30]

Page 3 of 12


Microformats %22[InChI];
Microformats [20] are a lightweight specification that
extends HTML to add semantic markup to web pages. For newElement.innerHTML =
example, hCard is a microformat that allows semantic
mark up of address information [21], and hCalendar is a supPubChem/sup;
microformat specification for the representation of calen-
dar information about events [22]. spanElement.parentNode.insertBefore(

A microformat specification has also been suggested for newElement, spanElement.nextSibling
chemistry that would make it much easier to recognise
compound names, InChIs, SMILES and CAS registry num- );
bers. Userscripts, or indeed any other programs, would
then no longer need to depend on regular expressions to }
find names and identifiers, but could use this markup to
accurately extract the identifier. Web resource interfaces
Web databases are the primary source of information used
For example, a web page implementing the InChI micro- by the discussed userscripts. While it is easy to have scripts
format would wrap any InChIs in a HTML span ele- create links to external web resources, it is also possible for
ment with a @class attribute as follows: span them to retrieve information from those resources and
class=inchiInChI=1//span. This information can include it in the HTML content of the web page the user is
easily be extracted using the document.evaluate method browsing. The latter is, for example, performed by the
which takes an XPath [23] expression (//span[@class=in userscript that adds comments from Postgenomic.com
chi] in this case): and Chemical blogspace to journal web pages.

allInChIs = document.evaluate( The general approach userscripts use to retrieve informa-
tion from external web resources uses HTTP just like any
'//span[@class=inchi]', document, null, web browser itself. To simplify the process, userscripts
tend to use a combination of XMLHttpRequest, possibly
XpathResult.UNORDERED_NODE_SNAPSHOT_TYPE, via the Greasemonkey GM_xmlhttpRequest wrapper
method, and the JavaScript Object Notation (JSON) for-
null mat [24] for data representation. The XMLHttpRequest
method retrieves the information using a URL that nor-
); mally points to a data interface, or API. The Postge-
nomic.com software has such an API that returns the blog
This code returns all HTML nodes that mark up InChI posts that discuss a particular article, as identified by its
strings using the InChI microformat. By iterating over DOI. Chemical blogspace uses the same API, and adds
these nodes, the userscript can insert new HTML elements, another one to return blog posts that discuss a particular
such as links to external resources as shown here in code molecule, as identified by its InChI. Both database APIs
taken from the Sechemtic userscript: can return the information as JSON objects, which is how
they are used in the discussed userscripts.
for (var i=0; iallInChIs.snapshotLength; i++){
Since our userscripts rely on a particular API or specially-
spanElement = allInChIs.snapshotItem(i); constructed URL to access an external resource, they will
fail if the external resource changes its API or the URL it
inchi = spanElement.innerHTML; provides to access it. This will not affect the browsing
experience of the user, but the additional functionality
// create a link to PubChem provided by the userscript will no longer be available. To
deal with this, each of the userscripts described in this arti-
newElement = document.createElement('a'); cle checks once a day for a new version and prompts the
user to install it if one is available. This means that when
newElement.href = http://www.ncbi.nlm. + a userscript is updated to deal with a new API or URL,
every user will quickly have access to the latest version.
nih.gov/entrez/query.fcgi?CMD=search +

DB=pccompoundterm=%22 + inchi +

Page 4 of 12


Results which can mark documents up automatically. In particu-
This paper introduces userscripts that have been written in lar, OSCAR3 [25,26], developed at the Unilever Centre for
our research groups as exemplars of how web resources Molecular Informatics at the University of Cambridge,
can be integrated and to outline how they can be used in and used by the Royal Society of Chemistry in their
research. Our userscripts can be classified into two broads Project Prospect [27], searches documents for chemical
areas: those that link chemical and biological data to web- names, spectra, and other chemical information, and
sites, and those that affect how we interact with the scien- automatically marks up the content using XML tags (to
tific literature. the extent of where possible generating machine readable
SMILES and InChI structures for chemicals referenced in
In the following sections, we describe in detail how func- the document).
tionality is added to the web page being browsed. Table 1
summarises the resources linked to, or accessed, by each We have created a userscript, ChemGM.user.js that will
script, as well as the unique identifier used. automatically run OSCAR on a web page and provide
inline hypertext links to PubChem for chemical structure
Interacting with the scientific literature names that are found in the page (including 2D structure
OSCAR3 running on HTML depictions generated by another web service and
Published journal articles and other web documents with PubChem searches). The userscript can be run on any web
chemistry content are not normally marked up by the page, but it is particularly applicable to online journal
publishers or authors to provide machine readable repre- articles and chemistry blogs. An example highlighting the
sentations of chemical structures and related information. effect of this userscript is shown in Figure 1. Note that
As a result, there has been active interest in methods though the images use an article from Chemistry Central

Highlighting and annotating chemical terms in an online journal article
Figure 1
Highlighting and annotating chemical terms in an online journal article. Screenshots showing the effect of the
ChemGM.user.js userscript on the Chemistry Central Journal web page (full URL: [47]) for Majumder et al. [48]. (a) When the
userscript is running a toolbar is added to the top of every webpage. Clicking the highlight button in the toolbar causes the
contents of the webpage to be analysed for chemical terms. (b) shows the original text of the abstract. (c) After a minute or so,
any chemical terms recognised are highlighted in yellow, and are annotated with hypertext links to their entries in PubChem (if
available) and a 2D depiction of the image.

Page 5 of 12


Journal, the script can be applied to any web page, irrespec- ever the user accesses the website of a journal publisher. It
tive of its source or content. identifies any DOIs on the page, and uses the Chemical
blogspace and Postgenomic APIs to find out whether
Add quotes from Chemical blogspace and Postgenomic to DOIs those DOIs have been referenced in a blog post. If so, an
It can be a challenge to keep up with the primary literature icon is added to the web page next to the DOI which, if
in a field. At the same time, there are a large number of sci- hovered over with the mouse, causes a popup to appear
entific blogs, many of which have reviews of the recent lit- containing the name of the citing blog post, the blog
erature or highlight interesting papers. The Postgenomic name, and the first few lines of text of the blog. The full
web site was developed by Euan Adie and later hosted by content can be accessed by clicking on the title of the blog
Nature Publishing Group and currently aggregates infor- post. In this way, content from blog articles widely dis-
mation from over 750 scientific blogs [3]. The source code persed in terms of the web is brought directly to where it
is open and has been used by one of the authors (ELW) to is likely to be of most interest – the journal web site. Fig-
establish a similar site, Chemical blogspace, for over 140 ure 2 shows the effect of this userscript when running on
blogs with chemical content [4]. Both of these sites iden- the HTML version of Spjuth et al. [29].
tify references to journal articles in blogs, and make this
information available through an API. Compared to the Providing reviews of journal articles is only one of the uses
Postgenomic website, the Chemical blogspace site also of such a userscript. It is also a general way to create a link
identifies molecules referenced in blogs either by micro- between the content of a blog post and a particular paper.
format markup of InChI and SMILES, or by analysing In this way, bloggers can use blog posts to enhance the
links to Wikipedia [28]. If the latter link points to a wiki original journal website without any intervention
page that contains a PubChem compound identifier or an required by the publisher. For example, the author of a
InChI, then the molecular structure is linked to the blog paper may write a blog post which provides additional
post. supporting information for a journal article or includes
the article preprint for those who do not have a subscrip-
This userscript uses the aggregated information collected tion. Alternatively, the author of a paper may write a blog
by Postgenomic and Chemical blogspace. It runs when- post and include the DOIs of all of the references. This

Figure 2
Adding information to DOIs on journal web pages
Adding information to DOIs on journal web pages. Screenshots from the BMC Bioinformatics web page (full URL: [49])
for Spjuth et al. [29] (a) without any userscript enabled, and (b) showing the effect of the two userscripts Add quotes from
Chemical blogspace and Postgenomic to DOIs and Add to Connotea. The latter added a Connotea logo (a 'c' surrounded by
linking arrows), which links to the Connotea dialog box for adding this paper to your library, and a number indicating how
many people have already bookmarked this paper, which links to the existing entry for this paper on Connotea. The Add
quotes userscript added the Cb logo, which links to the Chemical blogspace page for this paper, and a Pg logo, linking to the
Postgenomic page. The popup titled Powered by Postgenomic.com (only partially shown) appears when the mouse is placed
on the Pg logo, and contains quotes from and links to the citing blog articles.

Page 6 of 12


would not only promote his/her own paper (all of the On a technical note, this userscript illustrates some tech-
cited papers would show a blog comment pointing to the niques necessary for accessing an API that requires a user
citing paper), but would result in an eventual network of name and password and that, in addition, only permits
citations which could be used to measure the impact of a one API request every two seconds or so. Note that this
paper. userscript requires the user to have a Connotea account
(which is freely available at Ref. [30]).
Add to Connotea
Connotea is a social bookmarking site developed by Linking to chemical and biological data sources
Nature Publishing Group for scientists [30]. It allows a Enhancement of PubChem with 3D structures
user to bookmark websites using either the DOI or a URL, The PubChem repository is a public collection of over 10
and to tag those bookmarks. Crucially, it also provides an million compounds [32]. The database contains 2D struc-
API for retrieving information. tures as well as a number of precomputed properties (such
as number of heavy atoms and topological polar surface
The Add to Connotea userscript has two aspects. Firstly, area [33]). The web interface to this database allows a
it makes it easy to add papers to Connotea from journal wide variety of queries. The results are usually represented
webpages, by adding a hyperlink in the form of the Con- in the form of a summary web page containing images of
notea logo next to every DOI identified on a journal web the 2D structures of all the compounds satisfying the
page. Clicking on the logo brings the user to the Connotea query with links to pages for individual compounds
page for adding new papers. This aspect of the userscript which provide a summary of the properties of the com-
is not entirely novel. A userscript has previously been pound. In many cases it would be useful to be able to view
developed which allows the user to add papers to Con- an image of the 3D structure of a molecule. However,
notea from NCBI PubMed [31]. In addition, a small PubChem currently does not contain 3D structures for the
number of publishers (which includes BioMed Central compounds stored in the database.
and Nature Publishing Group), provide a facility to add
papers to Connotea directly from their website. Our user- To address this problem, we developed a database of 3D
script differs in that it will work on the website of any jour- structures of PubChem compounds as part of our web
nal publisher where the text contains DOIs. service infrastructure for chemoinformatics [34]. The
structures were generated using a two-step process in
The second aspect of this userscript is more interesting in which the SMILES were converted to a set of rough 3D
the context of this paper. The userscript queries the Con- coordinates using stochastic proximity embedding [35]
notea API to find out how many people have previously and subsequently geometry optimised using the MMFF94
added this paper to their Connotea account. It then adds force field, using in-house code. A number of compounds
this number next to the Connotea icon. Clicking on the were excluded from the final 3D database since the force
number brings you to the Connotea page for that paper. field did not contain parameters for certain atom types.
From here it is possible to access comments on the paper. However the 3D database, known as Pub3D [36], con-
More useful perhaps, is the ability to find related papers tains approximately 99% of the compounds in PubChem.
by looking at the other papers a particular Connotea user Pub3D is wrapped by a set of web services which encapsu-
has tagged with the same tag. Figure 2 shows the effect of late common queries including finding a structure by
this userscript when running on the HTML version of compound ID (CID) or finding structures matching a
Spjuth et al.[29]. SMARTS pattern.

This aspect of the userscript has the potential to affect the Using this web service interface we created a userscript
way we read the literature. The number of times a particu- called 3DStructureView.user.js that allows 3D structures
lar paper has been bookmarked on Connotea can be con- from our database to be shown when users visit the
sidered a measure of its importance or its interest. In the PubChem website (see Figure 3). The script is designed to
past, measures such as the number of citations have served work only on the summary and detail pages that a user
this purpose, but this information is generally not shown views after a PubChem search. It parses the page and iden-
on journal web pages as it is not freely available. Another tifies the compound ID which is then used to construct a
effect of this userscript is to link the paper the user is view- call to the Pub3D database. The return value is a string
ing to related papers through the Connotea website. If a containing the 3D structure of the compound, in SD for-
researcher finds that a particular paper has been book- mat, which is used to construct an appropriate URL. The
marked on Connotea and is of interest to him or her, he result of this process is that the user can now click on a
or she can is likely to find other relevant papers by brows- link titled 3DView(Jmol), which will cause a Jmol applet
ing through the other papers bookmarked by the same [37-39] window to appear showing the 3D structure of the
Connotea user with the same tag. compound in question.

Page 7 of 12


Figure 3
Adding 3D models to PubChem
Adding 3D models to PubChem. Screenshot of the PubChem web page for aspirin (full URL: [50]) with the
3DStructureView userscript enabled. The userscript added the first line of text in the compound summary information.
Clicking on the 3DView(Jmol) link causes a window to popup showing a 3D model of the structure. Clicking on the SDF
Format link allows the user to download the calculated 3D structure of the molecule in SDF file format.

As an example, after installing the script, one can navigate download the 3D structure in the SD file format (see Fig-
to the PubChem website [32] and search for entries ure 3).
related to aspirin. This should return slightly more than
thirty hits. If one then clicks on the compound ID for the PDB-Jmol Greasemonkey Script
first hit, one is taken to a summary page which provides The Protein Data Bank [40] is a repository of experimen-
various details regarding the molecular structure and bio- tally-determined 3D coordinates of proteins. Each entry
logical activity of aspirin. In addition to the data provided has a PDB ID, which is a unique four letter identification
by PubChem, the userscript has enhanced the page to add code consisting of a number followed by three characters
two links: 3DView(Jmol) and SDF Format. The former link which can be either letters or numbers; for example, 1abe,
will bring up an instance of the Jmol applet showing a 3D 114L and 6NN9. The PDB-Jmol Greasemonkey userscript
structure of aspirin, while the second link allows one to identifies all PDB IDs on web pages and adds hyperlinks

Page 8 of 12


to the FirstGlance in Jmol web page [41] for that protein. in particular with Google searches, the links based on
This website uses the Open Source molecular viewer Jmol InChIs are more useful as the same molecule may be rep-
to show the protein as a 3D model which can be manipu- resented by several different SMILES strings but only a sin-
lated by the user. In this way, the user can instantly view gle InChI.
the 3D structure of any PDB ID mentioned on a website,
and in particular, if the user is reading the HTML version From a technological point of view, these scripts are very
of a journal article on-line, all PDB IDs in the paper will simple in nature; the semantic nature of the (chemical)
similarly be enhanced. Figure 4 shows an example of the microformats is what makes this simple script possible.
latter case where PDB identifiers in the the online version The semantic markup in HTML for InChIs that is picked
of Mardia et al. [42] have been identified and links added. up by the userscript looks like span class=inchiInChI
=1/CH4/h1H4/span while the markup for a SMILES
As this userscript runs on all web pages accessed by the string looks like span class=smilesCCO/span.
user, and since the search term is simply 4 characters long,
additional constraints are necessary to prevent excessive Add quotes from Chemical blogspace to molecules
false positive identification. The userscript only looks for This userscript, quite similar to the one that adds com-
PDB IDs if it finds one of the following terms in the web ments to DOIs, runs on all web pages accessed by the user.
page: protein, PDB, or enzyme. Using the same method as the Sechemtic userscript (see
above), it identifies any molecules referenced on a page
Sechemtic which have been marked up with the appropriate tags. It
Sechemtic is a small userscript that detects use of micro- also supports the (non-marked up) InChI tags on
formats (see Implementation) to markup molecular iden- PubChem. It then uses the Chemical blogspace API to find
tifiers, as well as regular molecular names. It recognises out whether this molecule has been referenced in a blog
markup for the IUPAC InChI and SMILES, and creates post. The remainder is as for the previous userscript; an
links for those molecules to web resources like eMolecules icon is added which contains a popup to the citing blog
[43], PubChem [32] and a link to Google to search for post. Figure 6 shows the effect of the userscript on the
more information (see Figure 5). It should be noted that, PubChem page for methane (InChI=1/CH4/h1H4). A full

Figure 4
The effect of the PDB-Jmol userscript
The effect of the PDB-Jmol userscript. Screenshots from the BMC Bioinformatics web page (full URL: [51]) for Mardia et al.
[42] showing a paragraph containing PDB identifiers (a) without the PDB-Jmol userscript installed, and (b) with the PDB-Jmol
userscript installed. The Jmol text in yellow is a hyperlink to the FirstGlance in Jmol page [41] for a particular protein struc-
ture.

Page 9 of 12


Figure 5
Annotating chemical terms marked up with microformats
Annotating chemical terms marked up with microformats. Screenshots showing a blog post (full URL: [52]) containing
chemical terms marked up with chemical microformats, (a) without and (b) with the Sechemtic userscript enabled. The added
hyperlinks allow the user to look up the structure in Google, ChemSpider and PubChem.

list of molecules with comments in Chemical blogspace is and any use of chemical microformats will be picked up
available from Ref. [44]. adding links to Google, eMolecules, ChemSpider and
PubChem.
A possible use of this script is to link all discussions of a
particular drug in the blogosphere to a static page contain- These examples show that userscripts offer a powerful
ing information on the drug. Another use is to link discus- technology to improve the way we read the scientific liter-
sions on syntheses of molecules to pages containing ature and access (bio)chemical databases. This is done by
references to the molecule. dynamically combining web resources, and enriching the
information content of the primary resources. Theoreti-
Discussion cally, such links can be made on the web server itself, and
Here we have focused on the development of userscripts this is commonly done, but it does not give the user the
that enhance web pages for biologists and chemists. If all flexibility to choose what features to install. The crucial
of these userscripts are installed, any web page with a PDB point about userscripts is that they do not require the
code will now contain a link to view the structure in 3D, involvement of the web site provider. All of the enhance-
journal webpages will show chemical structure markup ments are done on-the-fly by the user's browser.
and blog comments on articles, 3D structures and links to
appropriate blog posts will be available from PubChem,

Page 10 of 12


Figure 6
Adding comments from the blogosphere to molecules
Adding comments from the blogosphere to molecules. Screenshots from the PubChem web page for methane (full
URL: [53]), (a) without and (b) with the Add quotes from Chemical blogspace to molecules userscript enabled. The InChI
InChI=1/CH4/h1H4 is identified by the userscript, which then adds the Cb logo. The logo is a link to the Chemical blogspace
page for this molecule. The popup titled Powered by Chemical blogspace (only partially shown) appears when the mouse is
placed on the Cb logo, and contains quotes from and links to blog posts that discuss this molecule.

The userscripts combine a number of technologies for Conclusion
data retrieval and communication. Information from We have shown that userscripts are a simple and useful
HTML pages is extracted using identifiers, regular expres- way of integrating bio- and chemoinformatics web
sions, XPath queries and microformats. It is noted that the resources. In particular, they permit (a) the augmentation
syntax of (bio)chemical and other identifiers is generally of existing websites with functionality not envisioned or
not distinct enough to detect them with perfect recall and indeed wanted by the original author, (b) the integration
optimal precision. It is easiest to write regular expressions of information from different domains, and (c) a connec-
for the DOI and the InChI with a high precision, com- tion point between the social web (wikis, blogs etc.) and
pared to, for example, the PDB ID which has a syntax traditional web tools and sites. We continue to find inter-
which can clash with other web page content. esting uses for userscripts, and we hope this manuscript
will spur others to do likewise.
Microformats offer a solution for such less well-defined
identifiers. This technology is used to wrap identifiers Availability and requirements
with some semantic markup so that the userscript can eas- • Project name: Userscripts for Chemistry and Biology
ily extract the identifiers using XPath queries. However,
microformats do not incorporate a mechanism to provide • Project home page: Blue Obelisk [45] website [46].
details on what a microformat means. That is, microfor- Download link: http://blueobelisk.sf.net/wiki/Userscripts
mats are not backed up by a specified ontology. As a result
the chemical 'smiles' microformat, to markup SMILES, • Operating system(s): Platform independent
may collide with a microformat specification to markup
moods. • Programming language: JavaScript

Once the identifier is extracted by whatever means, the • Other requirements: Firefox with Greasemonkey add-
userscripts can either create links to other web resources, on (or equivalent) for userscript support; Java is required
or query those resources and embed results into the HTML to view the Jmol applet; a Connotea account is required
of the web page on which the userscript is run. While any for the Add to Connotea userscript
HTTP-based approach can be used for this, the example
userscripts show that combining XMLHttpRequest with • License: GNU GPL, BSD
JSON [24] is a rather straightforward approach.
• Any restrictions to use by non-academics: none

Page 11 of 12


Authors' contributions 28. Wikipedia [http://www.wikipedia.org/]
29. Spjuth O, Helmus T, Willighagen EL, Kuhn S, Eklund M, Wagener J,
NMOB, ELW, HG and DJ have written userscripts men- Murray-Rust P, Steinbeck C, Wikberg JES: Bioclipse: An open
tioned in this text. RG developed and maintains the 3D source workbench for chemo- and bioinformatics. BMC Bioin-
structure database and contributed to the development of formatics 2007, 8:59.
30. Connotea [http://www.connotea.org/]
the Pub3D userscript. DW and CS devised and tested 31. pubmed2connotea 2006 [http://lindenb.integragen.org/
some of the userscripts. All authors have read and pubmed2connotea/].
approved the final manuscript. 32. PubChem [http://pubchem.ncbi.nlm.nih.gov/]
33. Ertl P, Rohde B, Selzer P: Fast Calculation of Molecular Polar
Surface Area as a Sum of Fragment Based Contributions and
Acknowledgements Its Application to the Prediction of Drug Transport Proper-
Pedro Beltrão is acknowledged for laying the foundations of the userscript ties. J Med Chem 2000, 43:3714-3717.
34. Dong X, Gilbert KE, Guha R, Heiland R, Kim J, Pierce ME, Fox GC,
that adds blog comments to journal web pages. We thank the anonymous Wild DJ: Web Service Infrastructure for Chemoinformatics.
reviewers for their constructive comments. J Chem Inf Model 2007, 47:1303-1307.
35. Agrafiotis DK: Stochastic Proximity Embedding. J Comp Chem
2003, 24:1215-1221.
References 36. Pub3D [http://rguha.ath.cx/~rguha/cicc/p3d/]
1. Galperin MY: The Molecular Biology Database Collection: 37. Jmol: an open-source Java viewer for chemical structures in
2007 update. Nucleic Acids Res 2007, 35:D3-D4. 3D [http://www.jmol.org]
2. Fox JA, McMillan S, Ouellette BFF: A compilation of molecular 38. Willighagen E, Howard M: Fast and Scriptable Molecular Graph-
biology web servers: 2006 update on the Bioinformatics ics in Web Browsers without Java3D. Available from Nature Pre-
Links Directory. Nucleic Acids Res 2006, 34:W3-W5. cedings 2007 [http://dx.doi.org/10.1038/npre.2007.50.1].
3. Postgenomic [http://postgenomic.com/] 39. Herráez A: Biomolecules in the computer: Jmol to the rescue.
4. Chemical blogspace [http://cb.openmolecules.net/] Biochem Mol Biol Edu 2006, 34:255-261.
5. Good BM, Kawas EA, Kuo BY, Wilkinson MD: iHOPerator: User- 40. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H,
scripting a personalized bioinformatics Web, starting with Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids
the iHOP website. BMC Bioinformatics 2006, 7:534. Res 2000, 28(1):235-242.
6. Etzold T, Ulyanov A, Argos P: SRS: information retrieval system 41. FirstGlance in Jmol [http://firstglance.jmol.org/]
for molecular biology data banks. Method Enzymol 1996, 42. Mardia KV, Nyirongo VB, Green PJ, Gold ND, Westhead DR: Baye-
266:114-128. sian refinement of protein functional site matching. BMC Bio-
7. Lee T, Pouliot Y, Wagner V, Gupta P, Calvert DS, Tenenbaum J, Karp informatics 2007, 8:257.
P: BioWarehouse: a bioinformatics database warehouse 43. eMolecules [http://emolecules.com/]
toolkit. BMC Bioinformatics 2006, 7:170. 44. Chemical blogspace – Chemical Compounds [http://cb.open
8. ChemSpider [http://chemspider.com/] molecules.net/inchis.php]
9. Karthikeyan M, Krishnan S, Pandey AK, Bender A: Harvesting 45. Guha R, Howard MT, Hutchison GR, Murray-Rust P, Rzepa H, Stein-
Chemical Information from the Internet Using a Distributed beck C, Wegner J, Willighagen EL: The Blue Obelisk-interopera-
Approach: ChemXtreme. J Chem Inf Model 2006, 46:452-461. bility in chemical informatics. J Chem Inf Model 2006,
10. BioSpider [http://biospider.ca/] 46:991-998.
11. Knox C, Shrivastava S, Stothard P, Eisner R, Wishart D: BioSpider: 46. Blue Obelisk Userscripts [http://blueobelisk.sourceforge.net/
A Web Server for Automating Metabolome Annotations. wiki/Userscripts]
Pacific Symp Biocomp 2007, 12:145-156. 47. Chemistry Central Journal web page for Majumder et al
12. Wilkinson MD, Links M: BioMOBY: an open source biological [http://journal.chemistrycentral.com/content/1/1/10]
web services proposal. Brief Bioinform 2002, 3(4):331-341. 48. Majumder AB, Shah S, Gupta MN: Enantioselective transacetyla-
13. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, tion of (R,S)-β-citronellol by propanol rinsed immobilized
Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel- Rhizomucor miehei lipase. Chem Cent J 2007, 1:10.
Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, 49. BMC Bioinformatics web page for Spjuth et al [http://
Rubin GM, Sherlock G: Gene ontology: tool for the unification www.biomedcentral.com/1471-2105/8/59]
of biology. The Gene Ontology Consortium. Nat Genet 2000, 50. PubChem page for aspirin [http://pubchem.ncbi.nlm.nih.gov/
25:25-29. summary/summary.cgi?cid=2244]
14. IUPAC International Chemical Identifier (InChI) [http:// 51. BMC Bioinformatics web page for Mardia et al [http://
www.iupac.org/inchi/] www.biomedcentral.com/1471-2105/8/257]
15. Greasemonkey [http://www.greasespot.net/] 52. Counting consitutional isomers from the molecular formula
16. Creammonkey [http://creammonkey.sourceforge.net/] [http://chem-bla-ics.blogspot.com/2006/12/counting-stereoisomers-
17. IE7pro [http://www.ie7pro.com/] from-molecular_17.html]
18. Turnabout [http://www.reifysoft.com/turnabout.php] 53. PubChem page for methane [http://pubchem.ncbi.nlm.nih.gov/
19. Coles SJ, Day NE, Murray-Rust P, Rzepa HS, Zhang Y: Enhance- summary/summary.cgi?cid=297]
ment of the chemical semantic web through the use of InChI 54. Google [http://google.com/]
identifiers. Org Biomol Chem 2005, 3(10):1832-1834.
20. Microformats [http://microformats.org/]
21. hCard Microformat [http://microformats.org/wiki/hcard]
22. hCalendar Microformat [http://microformats.org/wiki/hcalen
dar]
23. XML Path Language (XPath) 2.0 – W3C Recommendation
[http://www.w3.org/TR/2007/REC-xpath20-20070123/]
24. JSON [http://json.org/]
25. Townsend JA, Adams SE, Waudby CA, de Souza VK, Goodman JM,
Murray-Rust P: Chemical documents: machine understanding
and automated information extraction. Org Biomol Chem 2004,
2:3294-3300.
26. Corbett P, Murray-Rust P: High-thoughput identification of
chemistry in life science texts. In Computational Life Sciences II Vol-
ume 4216. Edited by: Berthold MR, Glen R, Fischer I. Berlin/Heidel-
berg: Springer-Verlag; 2006:107-118.
27. RSC Project Prospect [http://www.rsc.org/Publishing/Journals/
ProjectProspect/]

Page 12 of 12


SOFTWARE Open Access

Confab - Systematic generation of diverse low-
energy conformers
Noel M O’Boyle1,2*, Tim Vandermeersch2, Christopher J Flynn1, Anita R Maguire1 and Geoffrey R Hutchison2,3

Abstract
Background: Many computational chemistry analyses require the generation of conformers, either on-the-fly, or in
advance. We present Confab, an open source command-line application for the systematic generation of low-
energy conformers according to a diversity criterion.
Results: Confab generates conformations using the ‘torsion driving approach’ which involves iterating
systematically through a set of allowed torsion angles for each rotatable bond. Energy is assessed using the
MMFF94 forcefield. Diversity is measured using the heavy-atom root-mean-square deviation (RMSD) relative to
conformers already stored. We investigated the recovery of crystal structures for a dataset of 1000 ligands from the
Protein Data Bank with fewer than 1 million conformations. Confab can recover 97% of the molecules to within 1.5
Å at a diversity level of 1.5 Å and an energy cutoff of 50 kcal/mol.
Conclusions: Confab is available from http://confab.googlecode.com.

Introduction systematic search code from DOCK5 [14] to generate
The generation of molecular conformations is an essen- diverse conformers via a torsion-driving approach.
tial part of many computational analyses in chemistry, Confab 1.0 is the first release of Confab, an open
particularly in the field of computational drug design. source conformation generator whose goal is the sys-
Methods such as 3D QSAR, protein-ligand docking and tematic coverage of conformational space. Accuracy has
pharmacophore generation and searching [1] all require been favoured over the introduction of approximations
the generation of conformers, whether on-the-fly (as part to improve performance. The algorithm starts with an
of the method) or pre-generated by a stand-alone confor- input 3D structure which, after some initialisation steps,
mer generator. In contrast to 3D structure generators is used to generate multiple conformers which are
(such as CORINA [2], DG-AMMOS [3] and smi23d [4]), filtered on-the-fly to identify diverse low energy confor-
which focus on the generation of a single low-energy mers. Conformations are generated using the torsion-
conformation, conformation generators create an ensem- driving approach from a set of predefined allowed torsion
ble of conformers that cover the entire space of low- angles. Ring conformations are not currently sampled.
energy conformations or that part of conformational The first section of the paper describes the algorithm
space occupied by biologically-relevant conformers. used by the software and some implementation details.
Several proprietary conformation generators are cur- After this, two applications of the software are
rently available (including OMEGA [5], ROTATE [6], described: an analysis of the conformational space of a
Catalyst [7], Confort [8], ConfGen [9], Balloon [10] and dataset of 1000 molecules (which includes a comparison
MED-3DMC [11] among others) but only recently have to Multiconf-DOCK), and an investigation of the con-
open source conformation generators appeared: Frog2 formational preferences of a particular phenyl sulfone.
[12] generates conformers using a Monte Carlo
approach, while Multiconf-DOCK [13] adapts the Methods
Algorithm
* Correspondence: baoilleach@gmail.com The Confab algorithm is outlined in Figure 1. The input
1
Analytical and Biological Chemistry Research Facility, University College required is a 3D structure with reasonable bond lengths
Cork, Western Road, Cork, Co. Cork, Ireland and angles. Since the algorithm does not currently


J. Cheminf. 2011, 3, 8.


reducing the number of conformations that will be
tested. 2-fold symmetry is identified when a rotatable
bond involves an sp2 hybridised carbon atom where the
neighbouring two atoms affected by the rotation are
both of the same symmetry class. When this occurs the
allowed values of that torsion are halved by restricting
them to those less than 180°. The same is done for the
case of 3-fold symmetry at an sp 3 hybridised carbon
where the three neighbours are of the same symmetry
class; in this case the torsion angles are restricted to
those less than 120°. If graph symmetry is identified at
both ends of a rotatable bond, the result is multiplica-
tive; a 2-fold and a 3-fold symmetry combine to restrict
allowed values of the torsion angles to 360/6 = 60°.
The next step is to obtain an estimate of the energy of
the most stable conformer. Throughout Confab, ener-
gies are calculated using the MMFF94 forcefield [16].
The values of the bond stretching, angle bend, stretch
bend and out-of-plane bending terms are constant for
all conformers of the same molecule; only the torsion,
Van der Waals and electrostatic terms were repeatedly
evaluated. A low energy conformer is found using a sim-
ple greedy algorithm. Each torsion angle is optimised
starting with the most central torsion and proceeding
outwards. As this procedure is relatively fast (compared
to the combinatorial problem of searching for the global
optimum) it is repeated up to 16 times by testing the
Figure 1 Flowchart depicting the Confab algorithm.
four most central torsions in different orders. The low-
est energy conformer found is used as a reference point
for applying an energy cutoff during the conformer
explore ring conformations, any rings present should be search. If, during the actual conformer generation a
in reasonable conformations. lower energy conformer is found, this lower energy is
The first step of the algorithm is the identification of used instead for the reference from that point on.
rotatable bonds. These are defined as all acyclic single The main part of the algorithm is the systematic gen-
bonds where both atoms of the bond are connected to eration and assessment of all conformers described by
at least two non-hydrogen atoms, but neither atom of the allowed torsion angles. Confab generates each of
the bond is sp-hybridised. Note that this definition these in turn up to a user-specified cutoff (the default is
excludes rotation around bonds that interchange hydro- 10 6 ) and determines its energy relative to the lowest
gens (for example, the rotation of the hydrogens of a energy conformer found so far. If this is within a user-
methyl group), but this does not imply any loss of accu- specified energy cutoff (50 kcal/mol by default), it is
racy as it is usual practice to exclude hydrogens when assessed for diversity to the conformers already stored
calculating the RMSD (see below). (see below). If it is found to be diverse, it is itself stored
The method used by Confab to generate conformations otherwise it is discarded. The algorithm then moves
is known as the torsion-driving approach. A set of onto the next conformer.
allowed torsion angles for each rotatable bond is assigned Rather than iterate in a ‘depth-first’ manner over the
to each bond by searching for a match to predefined torsions and their allowed angles, Confab uses a Linear
SMARTS strings in a user-configurable file (torlib.txt) Feedback Shift Register (LFSR) to iterate in a random
included in the Confab distribution. This file is part of order over all of the conformers. A LFSR allows the
the Open Babel project and it assigns values to particular generation of all integers from 1 to N pseudorandomly
rotatable bonds using data from Huang et al. [15]. without repetition and without any memory overhead
Once the allowed torsion angles are assigned, they are (which is important for large values of N). By iterating
corrected for topological (that is, graph) symmetry. The randomly, Confab avoids biasing generated conformers
presence of such symmetry allows performance to be towards a particular region of conformational space, for
improved by eliminating redundant evaluations, thus example towards the input conformation. It also helps

J. Cheminf. 2011, 3, 8.


increase diversity if the number of possible conforma- child nodes containing that conformation are added at
tions is greater than the cutoff for the number tested. successively lower levels until the bottom level is
Diversity is ensured by calculating the heavy-atom reached. Overall, there are two possibilities; either the
RMSD (after least-squares alignment) of the newly gen- algorithm reaches the bottom level and finds that the
erated conformation to those previously stored. The new conformation is within the RMSD cutoff of an
alignment is carried out using the QCP algorithm of existing conformer, in which case it is discarded, or else
Theobald [17] (which we found to be about twice as fast it is of sufficient diversity to be stored at some level of
as the popular Kabsch alignment method [18]). Despite the tree.
this, when a molecule has many conformers and a large This algorithm greatly reduces the number of RMSD
number of conformers have been stored, full pairwise evaluations during the conformer generation loop. How-
RMSD calculations take an excessive amount of time. ever it does not eliminate all conformations that are
To minimise the number of RMSD evaluations required similar to those already stored; conformations may be
to discard a conformer, chosen conformers are stored in retained that differ by less than the RMSD cutoff if they
a tree structure that effectively clusters conformers on- end up in different branches. To prune the set of
the-fly by RMSD. Figure 2(a) shows a typical ‘diversity retained conformations, while still avoiding a computa-
tree’ where each level of the tree is associated with a tionally expensive pairwise RMSD calculation, all of the
smaller RMSD diversity from 3.0 Å down to the cutoff retained conformations are added one-by-one to a new
specified by the user (1.6 Å in the figure). Each node of tree in order of increasing energy. This time the algo-
the tree represents a stored conformation. Sibling nodes rithm used for adding conformations to the diversity
(that is, nodes at the same level that share the same par- tree is more robust: all sibling conformations are tested
ent node) differ by at least the RMSD diversity asso- for similarity, even after finding one that is similar. The
ciated with that level. Note that sibling nodes are result is that the same conformation may be added at
ordered and that the first child node of each parent is several different points in the tree. This makes the tree
the same as the parent itself. more effective at eliminating similar conformations at
To illustrate the algorithm, let us imagine adding a the expense of a greater number of RMSD calculations.
new conformation H to the tree depicted in Figure 2(a). Calculation of an RMSD can be overestimated when a
The algorithm starts at the top of the tree and deter- molecule’s structure has automorphisms (a permutation
mines which of the two branches (A or B) to take at the of the atoms of a molecule that preserves the bond con-
3.0 Å diversity level. To do so it checks whether H is nections). For example, if you consider a para-substituted
within 3.0 Å RMSD of A. If so, it follows the tree down phenyl ring where two conformations differ by a rotation
to the next level, and checks to see whether it is within of 180° around the substituted carbons, it is clear that the
2.0 Å RMSD of A (note that it does not need to recalcu- calculated RMSD between the conformations should be
late the RMSD to do this). If this is not true, then it 0. However, if the symmetry of the phenyl ring is not
checks for 2.0 Å similarity to C. If so, it follows C down taken into account this will not be the case and the
to the next level; otherwise it checks against D. If it is RMSD will be overestimated as the corresponding atoms
not similar to D, H is stored in the tree as the next sib- of the two structures have moved. The symmetry-
ling at that level of the tree (this is depicted in Figure 2 corrected RMSD is obtained by iterating over the auto-
(b)). When adding a new node for a conformation at a morphisms of the molecule and taking the minimum
particular level, if the level is not at the bottom then value of the resulting RMSDs. For performance reasons,

Figure 2 An example diversity tree used to filter conformations on-the-fly. (a) A diversity tree containing five conformations (A to E) used
to filter conformations with an RMSD of less than 1.6 Å to one of the stored conformations. (b) The same diversity tree after addition of
conformer H, where H is within 3.0 Å of A but not within 2.0 Å of A, C or D.

J. Cheminf. 2011, 3, 8.


the calculation of the RMSD is not symmetry-corrected the 3D structure using Open Babel. After the initial
during the main conformation generation loop. However structure generation, the structures were optimised
it is used afterwards when building the final diversity using the MMFF94 forcefield (200 steps steepest des-
tree, thereby eliminating any conformations that were cent). Since Confab does not explore ring conforma-
retained in error. tions, ring conformations were taken from the crystal
structure for the initial structure generation. See Addi-
Implementation tional file 2 for the generated structures.
Confab is essentially a modified version of Open Babel
[19], a widely-used cheminformatics toolkit written in Results
C++ and available under the open source GPL v2 Figure 3(b) shows an overview of the dataset of 1000
licence [20]. In fact, some of the code written for Con- structures in terms of the number of rotatable bonds in
fab has been merged into the main Open Babel distribu- each molecule. Although the dataset contains molecules
tion (such as the original Kabsch alignment code) but with up to 12 rotatable bonds, it is clear by comparison
due to an additional dependency (on tree.hh, see below) with the full dataset of Borodina et al. in Figure 3(a)
the core code has not been included in Open Babel v2.3. that the reduced dataset is only a representative sample
The MMFF94 forcefield, the conformer generation fra- for molecules having up to 7 rotatable bonds. Beyond
mework and the automorphism detection are all pro- this, the restriction that the molecule must have fewer
vided by Open Babel. QCP alignment was implemented than 1 million conformers leads to the elimination of
using Theobald’s public domain code [21] in combina- most of the molecules. For this reason, to avoid
tion with the Eigen2 high performance linear algebra
library [22]. The diversity analysis code relies on a tree
data structure provided by the Open Source tree.hh
library [23]. The code used to implement the Linear
Feedback Shift Register (LFSR) was adapted from its cor-
responding Wikipedia article [24]. Tap values for the reg-
ister were taken from Alfke’s Xilinx application note [25].
The Confab distribution contains two command-line
applications: confab and calcrmsd. The former imple-
ments the Confab algorithm to generate conformers
given an input 3D structure, while the latter may be
used to assess the performance of confab by comparing
the generated conformers to a file containing crystal
structures. Full details of these applications are available
on the Confab website.

Coverage of Conformational Space
Dataset
To illustrate the performance of Confab, we used a
dataset of 1000 small molecule crystal structures derived
from that of Borodina et al. [26]. The original source is
the PDB; thus this dataset represents bioactive confor-
mations of molecules. The 3D structures of the 14504
ligands in the Borodina dataset were obtained using the
PubChem Download Service (using the PubChem Sub-
stance IDs from Borodina et al.). Of these, 16 could not
be handled by the MMFF94 forcefield, 5202 had no
rotatable bonds (this fraction included a large number
of trivial salts) and 2348 had more than 1 million con-
formers (according to Confab’s torsion rules). 1000
structures were randomly chosen from the 6938 remain-
ing. See Additional file 1 for the structures of these
1000 molecules. Figure 3 The distribution of molecules in terms of the number
To avoid bias towards the crystal structures, the input of rotatable bonds in (a) the dataset of Borodina et al., and (b)
our dataset of 1000 molecules.
conformations for Confab were generated by building

J. Cheminf. 2011, 3, 8.


erroneous conclusions some of the following analyses
(where stated) will not consider molecules having 8 or
more rotatable bonds.
Confab was used to exhaustively generate all low
energy conformers for each molecule in the dataset for
diversity values ranging from 0.4 Å to 3.0 Å RMSD. The
default setting of 50 kcal/mol was used as an energy
cutoff. The default value of 1 million conformers was
used as the conformer cutoff; this ensured exhaustive
coverage of conformational space (as defined by Con-
fab’s torsion rules) as structures with more conformers
were not included in the dataset (see above). Figure 4
shows the mean time for conformer generation per
molecule. This is largely independent of the diversity
level for diversity levels greater than or equal to 1.0 Å.
For values less than this, an increasing amount of time
is spent performing the pairwise RMSD calculations
against stored conformations.
Performance of conformer generators is typically mea-
sured by the percent recovery of crystal structures with
respect to a particular RMSD cutoff (see for example
Ref [9]). This is simply the percentage of molecules
which have a generated conformer within a particular
RMSD of the crystal structure. Commonly used values
for this RMSD cutoff are 2.0, 1.5 and 1.0 Å.
Figure 5(a) shows the percent recovery at these cutoffs
for different values of the RMSD diversity. At 2.0 Å
RMSD diversity, 99% are within 2.0 Å RMSD of the
crystal (83% within 1.5, 41% within 1.0); at 1.5 Å RMSD
diversity, 99% are within 2.0 Å (97% within 1.5, 50%
within 1.0); at 1.0 Å RMSD diversity, 99% are within 2.0
Å RMSD (98% within 1.5, 89% within 1.0). As expected,

Figure 5 Performance measured as % recovery of crystal
structures. (a) Performance for different RMSD cutoffs. The diversity
cutoff is where the value of the RMSD diversity is used as the RMSD
cutoff. (b) The RMSD cutoff required to achieve a particular level of
% recovery. The diagonal line indicates the maximum RMSD cutoff
expected when there is complete coverage.

the percentage of crystal structures that are found
decreases as the RMSD diversity increases. In particular,
the curves fall off steeply once the RMSD diversity is
greater than the required cutoff.
An interesting question to ask is what RMSD diversity
is required to recover X% of crystal structures with
respect to a particular RMSD cutoff? Figure 5(b) shows
the answer to this where X is 90%, 95% or 98%. For
example to find 95% of the crystal structures within a
2.0 Å cutoff an RMSD diversity of 2.4 Å (or smaller) is
Figure 4 Effect of diversity level on speed of conformer required, but to find the same percentage to within
generation. Times were measured on an Intel Xeon E5620
1.5 Å an RMSD diversity of 1.6 Å is needed. However,
Processor (2.4GHz, 4C) with 32GB RAM.
even an RMSD diversity of 0.4 Å will not recover 98%

J. Cheminf. 2011, 3, 8.


of the structures to within 1.0 Å (it only recovers 96%), different levels of RMSD diversity when the RMSD cut-
an indication of the inherent diversity of the generated off used is the same as the diversity level. The sharp fall
conformers as discussed further below. off below 1.4 Å is a deviation from the ideal behaviour
As pointed out by Borodina et al. [26], if the confor- described by Borodina et al.
mational space is perfectly covered and lacks any ‘holes’ Table 1 shows the median number of generated con-
then the RMSD diversity is an upper bound of the mini- formers tested for molecules with different numbers of
mum RMSD to the crystal structure. In other words, at rotatable bonds. Broadly speaking, about one third of
an RMSD diversity of 1.5 Å for example, all crystal the conformers pass the energy cutoff applied. Although
structures should be found to within 1.5 Å. The diago- the size of each individual subset is not very large, and
nal line in Figure 5(b) indicates the maximum RMSD the values for 6 rotatable bonds seem to be biased
cutoff expected if this ideal behaviour is observed. It is towards a larger number of conformers, some general
clear from the figure that at low RMSD diversity the points can still be made.
actual performance is poorer than this. The number of diverse conformers is much reduced
There are two main problems that give rise to gaps in by a higher diversity level. For example, for those mole-
conformational coverage. The first is that the allowed cules with 7 rotatable bonds there are approximately
torsion values may not encompass the specific torsion 11000 low energy conformers of which about 13% are
angle observed in the crystal structure. For this dataset, diverse at 0.5 Å RMSD, only 1.3% are diverse at 1.0 Å
there are 7 molecules for which the crystal structure RMSD, and only 0.16% are diverse at 1.5 Å RMSD.
could not be found within 2.0 Å even at 0.4 Å RMSD The values in Table 1 are in broad agreement with
diversity. These molecules (PubChem substance IDs of those reported by Smellie et al. [27] for a representative
584680, 823881, 825747, 826196, 828032, 830919 and subset of their dataset (see table three therein). They
834618), of which two represent different conformations make the point that the number of conformers required
of the same molecule, all involve sugar moieties and it to cover conformational space is really surprisingly low.
may be that the allowed torsion angles of the glycosidic For a molecule with 7 rotatable bonds in our dataset,
bond are too conservative. conformational space can be covered to within 1.0 Å
The second is that the granularity of the allowed tor- with merely hundreds of conformations while just tens
sion settings may not be sufficiently fine to allow solu- of conformations will achieve a coverage of 1.5 Å. Of
tions to be found to within a low RMSD cutoff. For course, these figures are expected to increase with each
example, a carbon-carbon single bond has 12 allowed additional rotatable bond.
torsion values from 0 to 360° in increments of 30°. If For completeness, Table 1 also reports median values
such a bond is centrally located in a large molecule, for the minimum RMSD to the crystal structure. How-
even if the crystal structure has similar torsion angles to ever, as a metric for coverage these values give a mis-
one of these conformers the RMSD may differ leadingly positive picture compared to the percent
significantly. recovery values discussed above.
Based on this dataset, the inherent granularity of the
Confab generated conformers is around 1.4 Å, as indi- Comparison with Multiconf-DOCK
cated by the “Diversity Cutoff” line in Figure 5(a) which Multiconf-DOCK [13] is another open source conformer
falls off sharply as the RMSD diversity decreases below generator that uses a torsion driving approach to imple-
1.4 Å. This line indicates the percent recovery at ment a systematic search to identify diverse low energy

Table 1 Relationship between the number of rotatable bonds, the number of conformers generated and the minimum
RMSD to the crystal structure
Rotatable Number of Total Conformers Low Energy Diverse Conformers (median) Minimum RMSD to crystal (median)
bonds† molecules (median) Conformers (median)
0.5 Å 1.0 Å 1.5 Å 2.0 Å 3.0 Å 0.5 Å 1.0 Å 1.5 Å 2.0 Å 3.0 Å
1 214 3 3 3 1 1 1 1 0.18 0.40 0.45 0.45 0.45
2 97 36 25 8 2 1 1 1 0.34 0.54 0.74 0.80 0.80
3 216 72 44 19 4 1 1 1 0.39 0.70 1.02 1.06 1.06
4 143 1296 582 96 9 2 1 1 0.52 0.80 1.07 1.14 1.24
5 86 3024 1065 189 24 4 1 1 0.60 0.82 1.14 1.31 1.34
6 114 186624 24317 2953 192 24 5 1 0.71 0.90 1.21 1.49 1.78
7 69 34992 10679 1402 139 17 4 1 0.66 0.83 1.14 1.44 1.73
†
The 61 molecules with 8 or more rotatable bonds are omitted.

J. Cheminf. 2011, 3, 8.


conformers. This software differs in that it uses the difficult to say whether this represents a less compre-
AMBER force field [28,29] (as implemented in DOCK5) hensive coverage of conformational space or whether
instead of MMFF94. In addition, it implements perfor- this is due to the use of different forcefields. In terms of
mance improvements such as search tree pruning by the minimum RMSD to the crystal structure, once again
partial energy estimation [14]. Like Confab, the software we see that Multiconf-DOCK performs better than Con-
requires a 3D structure as input. fab at the 2.0 Å and 1.5 Å RMSD diversity levels but
Multiconf-DOCK was used to generate conformations Confab is better at 1.0 Å RMSD diversity.
for the 1000 structures in the dataset using the same
input as for Confab but converted to MOL2 using Open Distance Distribution in Conformations of a
Babel v2.3.0. It should be noted that the specified Sybyl Phenyl Sulfone
atom types in the input MOL2 file have an effect on the Many conformer generators are focused on reproducing
conformations generated by Multiconf-DOCK. The bioactive conformations. However it is worth remember-
parameters used were taken from the example provided ing that the generation of conformers may also be useful
with the Multiconf-DOCK distribution, except that no in other contexts. Here we use Confab to as an aid to
restriction was placed on the number of generated con- interpret the NMR spectra for the phenyl sulfone shown
formations and the energy cutoff was set to 50 kcal/mol in Figure 6. The peak for the methylene carbon of the
(as used for Confab). Three different RMSD diversity ethyl ester was split unexpectedly (compared to an ana-
levels were investigated: 2.0 Å, 1.5 Å and 1.0 Å. For all logous sulfone where the phenyl group was replaced by
three diversity levels, the mean time spent per molecule tert-butyl), and our hypothesis was that this was due to
was 6.3 s (measured on the same machine used for the close approach of the methylene carbon to one of
Figure 4). the sulfonyl oxygens in solution. Confab was used to
The performance in terms of percent recovery is as investigate whether low energy conformations existed
follows: at 2.0 Å RMSD diversity, 99% are within 2.0 Å where the methylene group was in close proximity to a
RMSD of the crystal structure (89% within 1.5, 55% sulfonyl oxygen.
within 1.0); at 1.5 Å RMSD diversity, 99% are within 2.0 Confab was used to generate a set of conformations of
Å (97% within 1.5, 64% within 1.0); at 1.0 Å RMSD the molecule with a diversity of 0.2 Å and no energy
diversity, 99% are within 2.0 Å (98% within 1.5, 80% cutoff. The resulting 2014 conformations were optimised
within 1.0). These values are broadly similar to those for using a MMFF94 forcefield (200 steps steepest descent;
Confab (see above). The most noticeable differences implemented using Pybel [30]) and the final energy
occur for the percentage of structures found to within recorded. For each of the conformations the minimum
1.0 Å RMSD; assuming that both programs successfully distance between a sulfonyl oxygen and the methylene
remove conformations that are within the diversity cut- carbon was measured.
off, Multiconf-DOCK outperforms Confab at the 2.0 Å Figure 7 shows a plot of these distances versus the
and 1.5 Å RMSD diversity levels but Confab performs relative energies of the conformers with marginal histo-
better at 1.0 Å RMSD diversity. grams showing the distribution of values. The methylene
Table 2 shows the median number of conformers gen- carbon does not approach the sulfonyl group very clo-
erated by Multiconf-DOCK, along with the minimum sely. For low energy conformers, the distances are clus-
RMSD to the crystal structure, broken down by the tered around 4.0 Å and 5.4 Å with the former more
number of rotatable bonds. Compared to Confab the frequent. Taking 5 kcal/mol as a cutoff, the distance can
number of conformers generated is far fewer. It is be as low as 3.7 Å but shorter distances (down to 3.0 Å)

Table 2 Results for Multiconf-DOCK showing the relationship between the number of rotatable bonds, the number of
conformers generated and the minimum RMSD to the crystal structure
Rotatable bonds Diverse Conformers (median) Minimum RMSD to crystal (median)
1.0 Å 1.5 Å 2.0 Å 1.0 Å 1.5 Å 2.0 Å
1 1 1 1 0.34 0.40 0.40
2 3 1 1 0.50 0.67 0.71
3 2 1 1 0.68 0.78 0.81
4 9 3 1 0.76 0.97 1.05
5 14 4 2 0.85 1.03 1.28
6 43 15 5 1.08 1.23 1.37
7 21 8 3 1.04 1.24 1.40

J. Cheminf. 2011, 3, 8.


that reduce the search space on the basis of heuristics
have been avoided for this reason.
Using the results from Confab 1.0 as a comparison,
future work will investigate strategies to to overcome
the combinatorial explosion associated with large num-
bers of rotatable bonds [31] including the trade-off
between speed and accuracy.

Availability and Requirements
Project name: Confab
Project home page: http://confab.googlecode.com
Operating system(s): Cross-platform
Programming language: C++
Other requirements (if compiling): CMake 2.4+,
Eigen2
Licence: GPL v2
Figure 6 Structure of the phenyl sulfone studied.
Additional material

are only possible with an associated energy penalty. Additional file 1: Crystal structures used to test conformational
Figure 6 shows one of the low energy conformations coverage. This is a text file in SDF format containing biological
conformations (as downloaded from PubChem) of 1000 molecules. This
(relative energy of 4.6 kcal/mol) which has a distance of
is a subset of the data used in the study by Borodina et al.
3.7 Å between the groups of interest.
Additional file 2: Generated 3D structures used to test
conformational coverage. This is a text file in SDF format containing
Conclusion 3D structures of the 1000 molecules in the dataset generated using
Open Babel. These were used as the input to Confab.
The goal of this first release of Confab is to ensure com-
plete coverage of all of the low energy conformers of a
molecule. While every effort is made to maximise perfor-
mance, accuracy has been the main goal. Approximations
Acknowledgements and Funding
NMOB is supported by a Health Research Board Career Development
Fellowship, PD/2009/13. We thank several beta testers for their valuable
feedback, and the anonymous reviewers for their constructive comments.

Author details
1
Analytical and Biological Chemistry Research Facility, University College
Cork, Western Road, Cork, Co. Cork, Ireland. 2Open Babel development team.
3
Department of Chemistry, University of Pittsburgh, Chevron Science Center,
219 Parkman Avenue, Pittsburgh, PA 15260, USA.

Authors’ contributions
NMOB devised and implemented Confab, and carried out the coverage
analysis. GRH implemented the conformer generation framework in Open
Babel and contributed to the forcefield code. TV implemented the
automorphism code in Open Babel and contributed to the forcefield code.
NMOB collaborated with CJF and ARM on the sulfone investigation. All
authors read and approved the final manuscript.

Received: 9 February 2011 Accepted: 16 March 2011
Published: 16 March 2011

References
1. Schwab CH: Conformations and 3D pharmacophore searching. Drug
Discov Today Tech 2010, 7:e245-e253.
2. Sadowski J, Gasteiger J, Klebe G: Comparison of Automatic Three-
Dimensional Model Builders Using 639 X-Ray Structures. J Chem Inf
Comput Sci 1994, 34(4):1000-1008.
Figure 7 Scatterplot with marginal histograms of distance 3. Lagorce D, Pencheva T, Villoutreix BO, Miteva MA: DG-AMMOS: A New tool
versus energy for the set of conformations of the phenyl to generate 3D conformation of small molecules using Distance
sulfone in Figure 6. Geometry and Automated Molecular Mechanics Optimization for in
silico Screening. BMC Chem Biol 2009, 9:6.

J. Cheminf. 2011, 3, 8.


4. Gilbert K, Guha R: smi23d. [http://www.chembiogrid.org/cheminfo/smi23d/].
5. Hawkins PCD, Skillman AG, Warren GL, Ellingson BA, Stahl MT: Conformer
Generation with OMEGA: Algorithm and Validation using High Quality
Structures from the Protein Databank and Cambridge Structural
Database. J Chem Inf Model 2010, 50:572-584.
6. Renner S, Schwab CH, Gasteiger J, Schneider G: Impact of Conformational
Flexibility on Three-Dimensional Similarity Searching Using Correlation
Vectors. J Chem Inf Model 2006, 46:2324-2332.
7. Catalyst. Accelrys Inc: San Diego, CA; [http://accelrys.com/].
8. Confort. Tripos Inc: St Louis, MO; [http://www.tripos.com/].
9. Watts KS, Dalal P, Murphy RB, Sherman W, Friesner RA, Shelley JC: ConfGen:
A Conformational Search Method for Efficient Generation of Bioactive
Conformers. J Chem Inf Model 2010, 50:534-546.
10. Vainio MJ, Johnson MS: Generating Conformer Ensembles Using a
Multiobjective Genetic Algorithm. J Chem Inf Model 2007, 47:2462-2474.
11. Sperandio O, Souaille M, Delfaud F, Miteva MA, Villoutreix BO: MED-3DMC:
A new tool to generate 3D conformation ensembles of small molecules
with a Monte Carlo sampling of the conformational space. Eur J Med
Chem 2009, 44:1405-1409.
12. Miteva MA, Guyon F, Tufféry P: Frog2: Efficient 3D conformation
ensemble generator for small compounds. Nucleic Acids Res 2010, 38:
W622-W627.
13. Sauton N, Lagorce D, Villoutreix BO, Miteva MA: MS-DOCK: Accurate
multiple conformation generator and rigid docking protocol for multi-
step virtual ligand screening. BMC Bioinformatics 2008, 9:184.
14. Makino S, Kuntz ID: Automated flexible ligand docking method and its
application for database search. J Comput Chem 1997, 18:1812-1825.
15. Huang N, Shoichet BK, Irwin JJ: Benchmarking Sets for Molecular Docking.
J Med Chem 2006, 49:6789-6801.
16. Halgren TA: Merck molecular force field. I. Basis, form, scope,
parameterization, and performance of MMFF94. J Comp Chem 1996,
17:490-519.
17. Theobald DL: Rapid calculation of RMSDs using a quaternion-based
characteristic polynomial. Acta Cryst A 2005, 61:478-480.
18. Kabsch W: A solution for the best rotation to relate two sets of vectors.
Acta Cryst A 1976, 32:922-923.
19. Hutchison GR, Morley C, Vandermeersch T, O’Boyle NM, James C, et al:
Open Babel, v2.3. [http://openbabel.org].
20. Free Software Foundation: GNU General Public License, v2. [http://www.
gnu.org/licenses/old-licenses/gpl-2.0.html].
21. Theobald DL: QCProt, v1.1. [http://theobald.brandeis.edu/qcp/].
22. Guennebaud G, Jacob B, et al: Eigen, v2.0.15. [http://eigen.tuxfamily.org].
23. Peeters K: tree.hh, v2.65. [http://tree.phi-sci.com/].
24. Linear feedback shift register. Wikipedia [http://en.wikipedia.org/wiki/
Linear_feedback_shift_register], Retrieved Aug 11, 2010.
25. Alfke P: Efficient Shift Registers, LFSR Counters, and Long Pseudo-
Random Sequence Generators. Xilinx application note 1996 [http://www.
xilinx.com/support/documentation/application_notes/xapp052.pdf],
XAPP052.
26. Borodina YV, Bolton E, Fontaine F, Bryant SH: Assessment of
Conformational Ensemble Sizes Necessary for Specific Resolutions of
Coverage of Conformational Space. J Chem Inf Model 2007, 47:1428-1437.
27. Smellie A, Kahn SD, Teig SL: Analysis of Conformational Coverage. 1.
Validation and Estimation of Coverage. J Chem Inf Comput Sci 1995,
35(2):285-294.
28. Weiner SJ, Kollman PA, Case DA, Singh UC, Ghio C, Alagona G, Profeta S Jr,
Weiner P: A new force field for molecular mechanical simulation of

29.
nucleic acids and proteins. J Am Chem Soc 1984, 106:765-784.
Weiner SJ, Kollman PA, Nguyen DT, Case DA: An all atom force field for
simulations of proteins and nucleic acids. J Comput Chem 1986, scientist can read your work free of charge
7:230-252.
30. O’Boyle NM, Morley C, Hutchison GR: Pybel: a Python wrapper for the
OpenBabel cheminformatics toolkit. Chem Cent J 2008, 2:5. colleagues in other parts of the globe, by allowing
31. Beusen DD, Shands EFB, Karasek SF, Marshall GR, Dammkoehler RA: anyone to view the content free of charge.
Systematic search in conformational analysis. J Mol Struct THEOCHEM W. Jeffery Hurst, The Hershey Company.
1996, 370:157-171.
doi:10.1186/1758-2946-3-8 peer reviewed and published immediately upon acceptance
Cite this article as: O’Boyle et al.: Confab - Systematic generation of
diverse low-energy conformers. Journal of Cheminformatics 2011 3:8. cited in PubMed and archived on PubMed Central

J. Cheminf. 2011, 3, 8.

O’Boyle Journal of Cheminformatics 2011, 3:10

BOOK REPORT Open Access

Review of “Data Analysis with Open Source
Tools” by Philipp K Janert
Noel M O’Boyle

Book details Janert PK: Data Analysis with Open day-to-day; essentially classical methods were developed
Source Tools Sebastopol, CA: O’Reilly Media 2010 at a time of small and expensive datasets and no com-
Cheminformatics has been defined as the application of putational power, and hypothesis testing focused on
informatics methods to solve chemical problems [1]. determining whether an effect existed. Today we have
Such chemical problems are often represented in terms ample computing power and may be dealing with very
of data, be it activity data for a series of compounds or large datasets; also, we are usually more interested in
descriptor values for a compound library. While this the size of an effect (practical significance) rather than
new book from the O’Reilly stable is not aimed specifi- just whether it exists (statistical significance).
cally at cheminformaticians, the subtitle of “A Hands- Topics that could not be squeezed into a chapter
On Guide for Programmers and Data Scientists” makes proper have been placed in shorter “Intermezzos” at the
it clear that the target audience includes any scientists end of each section. For example, a short section on
whose day-to-day work involves analysing and interpret- “What about map/reduce?” at the end of “Mining Data”
ing data. reminds the reader that the map/reduce methodology
The book is broadly divided into four parts on (much hyped recently) is not a clever algorithm to
Graphics: Looking at Data, Analytics: Modeling Data, speed things up, but rather a piece of infrastructure that
Computation: Mining Data and Applications: Using makes it convenient to implement algorithms that are
Data. First of all, it should be noted that this is not a trivially parallelisable.
book about statistics (as Chapter 1 states explicitly). On the negative side, any cheminformatician who has
Neither is it a manual for numpy, Sage, matplotlib, been involved with QSAR studies will already be familiar
Gnuplot, R and so forth, as might be implied by the with the multivariate analysis methods discussed here
title. Instead, Janert focuses on discussing data analysis (Chapters 13 and 14), although I liked the observation
methods and techniques in depth, rather than skimming that “you will actually spend more time on data sets that
topics by following a cookbook or tutorial approach are totally worthless” in relation to clustering algo-
linked to particular software. This is as it should be - rithms. Also there are two chapters (out of 19) which
there are already documentation and manuals available will be of little interest as they focus on business intelli-
for all of these programs, and the reader is simply gence and financial calculations, although even there the
alerted to the availability of the software, its capabilities reader will find an introduction to the use of Berkeley
are described and some examples of use shown. DB and SQLite from Python, tools which I highly
This is a real practitioner’s book. Janert, a former recommend. There are also cases where the author per-
physicist and software engineer, is a consultant in data haps gives too much detail, but this is hardly a criticism -
analysis and mathematical modelling. He has taken his in a book of some 500 pages there is plenty of room.
hard-won knowledge and tried to get it all down on Overall though, I heartily recommend this book to
paper for the reader’s benefit. For example, in a chapter anyone working in cheminformatics whether they
with the provocative title of “What you really need to develop methods or apply them. Too often we rely on
know about classical statistics” he explains why intro- summary statistics such as mean and standard deviation
ductory statistics textbooks seem to cover methods and and forget to actually look at the data. Graphical analy-
topics at odds with the problems data analysts deal with sis gives you a feel for the data, and can often highlight
problems, interesting features, or mistaken assumptions.
Correspondence: baoilleach@gmail.com After reading this book, you should be very aware of
Analytical and Biological Chemistry Research Facility, University College Cork, both the advantages and pitfalls of a wide variety of
Western Road, Cork, Co. Cork, Ireland

© 2011 O’Boyle; licensee Chemistry Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.

J. Cheminf. 2011, 3, 10.

O’Boyle Journal of Cheminformatics 2011, 3:10 Page 2 of 2

analysis methods but you will also be reminded that the
goal of data analysis is not a picture or a number but
insight.

Competing interests
The author declares that they have no competing interests.

Received: 8 March 2011 Accepted: 24 March 2011
Published: 24 March 2011

Reference
1. Gasteiger J: Introduction. In Chemoinformatics - A Textbook. Edited by:
Gasteiger J, Engel T. Weinheim: Wiley-VCH; 2003:1-13.

doi:10.1186/1758-2946-3-10
Cite this article as: O’Boyle: Review of “Data Analysis with Open Source
Tools” by Philipp K Janert. Journal of Cheminformatics 2011 3:10.

scientist can read your work free of charge
colleagues in other parts of the globe, by allowing

J. Cheminf. 2011, 3, 10.


RESEARCH ARTICLE Open Access

Open Data, Open Source and Open Standards in
chemistry: The Blue Obelisk five years on
Noel M O’Boyle1*, Rajarshi Guha2, Egon L Willighagen3, Samuel E Adams4, Jonathan Alvarsson5,
Jean-Claude Bradley6, Igor V Filippov7, Robert M Hanson8, Marcus D Hanwell9, Geoffrey R Hutchison10,
Craig A James11, Nina Jeliazkova12, Andrew SID Lang13, Karol M Langner14, David C Lonie15, Daniel M Lowe4,
Jérôme Pansanel16, Dmitry Pavlov17, Ola Spjuth5, Christoph Steinbeck18, Adam L Tenderholt19, Kevin J Theisen20
and Peter Murray-Rust4

Abstract
Background: The Blue Obelisk movement was established in 2005 as a response to the lack of Open Data, Open
Standards and Open Source (ODOSOS) in chemistry. It aims to make it easier to carry out chemistry research by
promoting interoperability between chemistry software, encouraging cooperation between Open Source
developers, and developing community resources and Open Standards.
Results: This contribution looks back on the work carried out by the Blue Obelisk in the past 5 years and surveys
progress and remaining challenges in the areas of Open Data, Open Standards, and Open Source in chemistry.
Conclusions: We show that the Blue Obelisk has been very successful in bringing together researchers and
developers with common interests in ODOSOS, leading to development of many useful resources freely available
to the chemistry community.

Background molecules was created as a resource about chemical
The Blue Obelisk movement was established in 2005 at structure and nomenclature by biologists [1].
the 229th National Meeting of the American Chemistry The formation of the Blue Obelisk group is somewhat
Society as a response to the lack of Open Data, Open unusual in that it is not a funded network, nor does it
Standards and Open Source (ODOSOS) in chemistry. follow the industry consortium model. Rather it is a
While other scientific disciplines such as physics, biol- grassroots organisation, catalysed by an initial core of
ogy and astronomy (to name a few) were embracing interested scientists, but with membership open to all
new ways of doing science and reaping the benefits of who share one or more of the goals of the group:
community efforts, there was little if any innovation in
the field of chemistry and scientific progress was actively • Open Data in Chemistry. One can obtain all
hampered by the lack of access to data and tools. Since scientific data in the public domain when wanted
2005 it has become evident that a good amount of and reuse it for whatever purpose.
development in open chemical information is driven by • Open Standards in Chemistry. One can find visi-
the demands of neighbouring scientific fields. In many ble community mechanisms for protocols and com-
areas in biology, for example, the importance of small municating information. The mechanisms for
molecules and their interactions and reactions in biolo- creating and maintaining these standards cover a
gical systems has been realised. In fact, one of the first wide spectrum of human organisations, including
free and open databases and ontologies of small various degrees of consent.
• Open Source in Chemistry. One can use other
people’s code without further permission, including
* Correspondence: baoilleach@gmail.com
1
Analytical and Biological Chemistry Research Facility, Cavanagh Pharmacy
changing it for one’s own use and distributing it
Building, University College Cork, College Road, Cork, Co. Cork, Ireland again.


J. Cheminf. 2011, 3, 37.


Note that while some may advocate also for Open it should be straightforward to develop spectral
Access to publications, the Blue Obelisk goals (ODO- annotation and manipulation. However, currently
SOS) focus more on the availability of the underlying the Blue Obelisk lacks support for multi-dimensional
scientific data, standards (to exchange data), and code NMR and multi-equipment spectra (e.g. GC-MS).
(to reproduce results). All three of these goals stem 5. Crystallography: The Blue Obelisk software sup-
from the fundamental tenants of the scientific method ports the bi-directional processing of crystal struc-
for data sharing and reproducibility. ture files (CIF) and also solid-state calculations such
The Blue Obelisk was first described in the CDK as plane-waves with periodic boundary conditions.
News [2] and later as a formal paper by Guha et al. [3] There is considerable support for the visualisation of
in 2006. Its home on the web is at http://blueobelisk. both periodic and aperiodic condensed objects.
org. This contribution looks back on the work carried
out by the Blue Obelisk over the past 5 years in the Many of the current operations in installing and run-
areas of Open Data, Open Source, and Open Standards ning chemical computations and using the data are inte-
in chemistry. gration and customisation rather than fundamental
algorithms. It is very difficult to create universal plat-
Scope forms that can be distributed and run by a wide range
The Blue Obelisk covers many areas of chemistry and of different users, and in general, the Blue Obelisk delib-
chemical resources used by neighbouring disciplines (e.g. erately does not address these. Our approach is to pro-
biochemistry, materials science). Many of the efforts duce components that can be embedded in many
relate to cheminformatics (the scope of this journal) and environments, from stand-alone applications to web
we believe that many of the publications in Journal of applications, databases and workflows. We believe that a
Cheminformatics could be completely carried out using chemical laboratory with reasonable access to common
Blue Obelisk resources and other Open Source chemical software engineering techniques should be able to build
tools. The importance of this is that for the first time it customised applications using Blue Obelisk components
would allow reviewers, editors and readers to validate and standard infrastructure such as workflows and data-
assertions in the journal and also to re-run and re-ana- bases. Where the Blue Obelisk itself produces data
lyse parts of the calculation. resources they are normally done with Open compo-
However, Blue Obelisk software and data is also used nents so that the community can, if necessary, replicate
outside cheminformatics and certainly in the five main them.
areas that, for example, Chemical Markup Language Much of the impetus behind Blue Obelisk software is
(CML) [4] supports: to create an environment for chemical computation
(including cheminformatics) where all of the compo-
1. Molecules: This is probably the largest area for nents, data, specifications, semantics, ontology and soft-
Blue Obelisk software and data, and is reflected by ware are Openly visible and discussable. The largest
many programs that visualise, transform, convert current uses by the general chemical community are in
formats and calculate properties. It is almost certain authoring, visualisation and cheminformatics calcula-
that any file format currently in use can be pro- tions but we anticipate that this will shortly extend into
cessed by Blue Obelisk software and that properties mainstream computational chemistry and solid-state.
can be calculated for most (organic compounds). Although many of the authors are employed as research
2. Reactions: Blue Obelisk software can describe the scientists, there are also several people who contribute
semantics of reactions and provide atom-atom in their spare time and we anticipate an increasing value
matching and analyse stoichiometric balance in and use of the Blue Obelisk in education at all levels.
reactions.
3. Computational chemistry: Blue Obelisk software Open Source
can interpret many of the current output files from The development of Open Source software has been one
calculations and create input for jobs. The Quixote of the most successful of the Blue Obelisk’s activities.
project (see below and elsewhere in this issue) shows The following sections describe recent work in this area,
that Open Source approaches based on Blue Obelisk and Table 1 provides an overview of the projects dis-
resources and principles are increasing the availabil- cussed and where to find them online.
ity and re-usability of computational chemistry.
4. Spectra: 1-D spectra (NMR, IR, UV etc.) are fully Cheminformatics toolkits
supported in Blue Obelisk offerings for conversion Open Source toolkits for cheminformatics have now
and display. There is a limited amount of spectral existed for nearly ten years. During this period, some
analysis but the software gives a platform on which toolkits were developed from scratch in academia,

J. Cheminf. 2011, 3, 37.


Table 1 Blue Obelisk Open Source Software projects discussed in the text
Name Website
CML Tools
CMLXOM https://bitbucket.org/wwmm/cmlxom/
JUMBO http://sourceforge.net/projects/cml/
Cheminformatics Toolkits
Chemistry Development Kit (CDK) http://cdk.sf.net
Cinfony http://cinfony.googlecode.com
Indigo http://ggasoftware.com/opensource/indigo
JOELib http://sf.net/projects/joelib
Open Babel http://openbabel.org
RDKit http://rdkit.org
Web Applications
ChemDoodle Web Components http://web.chemdoodle.com
Jmol http://jmol.org
Integration
Bioclipse http://www.bioclipse.net
CDK-Taverna http://cdktaverna.wordpress.com
Lensfield2 https://bitbucket.org/sea36/lensfield2/
Interconversion
CIFXOM [95] https://bitbucket.org/wwmm/cifxom/
JUMBO-Converters https://bitbucket.org/wwmm/jumbo-converters/
OPSIN http://opsin.ch.cam.ac.uk
OSRA http://osra.sf.net
Structure Databases
Bingo http://ggasoftware.com/opensource/bingo
Chempound (Chem#) https://bitbucket.org/chempound
Mychem http://mychem.sf.net
OrChem http://orchem.sf.net
pgchem http://pgfoundry.org/projects/pgchem/
Text mining
ChemicalTagger [96] http://chemicaltagger.ch.cam.ac.uk/
OSCAR4 https://bitbucket.org/wwmm/oscar4/
Computational Chemistry
Avogadro http://avogadro.openmolecules.net
cclib http://cclib.sf.net
GaussSum http://gausssum.sf.net
QMForge http://qmforge.sf.net
Computational Drug Design
Confab [97] http://confab.googlecode.com
Pharao http://silicos.be/download
Piramid http://silicos.be/download
Sieve http://silicos.be/download
Stripper http://silicos.be/download
Other Applications
AMBIT2 http://ambit.sf.net
Brunn http://brunn.sf.net
Toxtree http://toxtree.sf.net
XtalOpt http://xtalopt.openmolecules.net

J. Cheminf. 2011, 3, 37.


whereas others were made Open Source by releasing in- Second-generation tools
house codebases under liberal licenses. When the Blue Although feature-rich and robust cheminformatics
Obelisk was established five years ago, the primary toolkits are useful in and of themselves, they can also be
toolkits under active development were the Chemistry seen as providing a base layer on which additional tools
Development Kit (CDK) [5,6], Open Babel [7], and JOE- and applications can be built. This is one of the reasons
Lib [8]. Of these, both the CDK and Open Babel con- that cheminformatics toolkits are so important to the
tinue to be actively developed. open source ‘ecosystem’; their availability lowers the bar-
The CDK project has been under regular development rier for the development of a ‘second generation’ of
over the last five years. Several features have been chemistry software that no longer needs to concern
implemented ranging from core components such as an itself with the low-level details of manipulating chemical
extensible SMARTS matching system and a new graph structures, and can focus on providing additional func-
(and subgraph) isomorphism method [9], to more appli- tionality and ease-of-use. Although a wide range of
cation oriented components such as 3D pharmacophore chemistry software has been built using Blue Obelisk
searching and matching, and a variety of structural-key components (see for example, the “Related Software”
and hashed fingerprints. In addition, there have been a link on the Open Babel website, [13] listing over 40 pro-
number of second generation tools developed on top of jects as of this writing, or “Software using CDK” at the
the CDK (see below). As well as the use of the CDK in CDK website), in this section we focus on second-gen-
various tools, it has been deployed in the form of web eration tools which themselves have been developed by
services [10] and has formed the basis of a variety of members of the Blue Obelisk.
web applications. Bioclipse [14] (v2.4 released in Aug 2010) and Avoga-
Since 2006, major new features of Open Babel include dro [15] (v1.0 in Oct 2009) are two examples of such
3D structure generation and 2D structure-diagram gen- software, based on the CDK and Open Babel, respec-
eration, UFF and MMFF94 forcefields, and significantly tively. Bioclipse (Figure 1) is an award-winning molecu-
expanded support for computational chemistry calcula- lar workbench for life sciences that wraps
tions. In addition, a major focus of Open Babel develop- cheminformatics functionality behind user-friendly inter-
ment has been to provide for accurate conversion and faces and graphical editors while Avogadro (Figure 2) is
representation in areas of stereochemistry, kekulisation, a 3D molecular editor and viewer aimed at preparing
and canonicalisation. The project has also grown, in and analysing computational chemistry calculations.
terms of new contributors, new support from commer- Both projects are designed to be extended or scripted by
cial companies, and second-generation tools applying users through the provision of a plugin architecture and
Open Babel to a variety of end-user applications, from scripting support (using Bioclipse Scripting Language
molecular editors to chemical database systems. [16], or Python in the case of Avogadro). An interesting
Two new Open Source cheminformatics toolkits have aspect of both Avogadro and Bioclipse is that they share
appeared since the original paper. In 2006 Rational Dis- some developers with the underlying toolkits and this
covery, a cheminformatics service company (since closed has driven the development of new features in the CDK
down), released RDKit [11] under the BSD License. This and Open Babel.
is a C++ library with Python and (more recently) Java Both products in turn act as extensible platforms for
bindings. RDKit is actively developed and includes code other software. Bioclipse, for example is used by soft-
donated by Novartis. Recent developments include the ware such as Brunn [17], a laboratory information sys-
Java bindings, as well as performance improvements for tem for microplate based high-throughput screening.
its database cartridge. Brunn provides a graphical interface for handling differ-
More recently, GGA Software Services (a contract ent plate layouts and dilution series and can automati-
programming company) released the Indigo toolkit [12] cally generate dose response curves and calculate IC50-
and associated software in 2009 under the GPL. Indigo values. Avogadro is used by Kalzium [18], a periodic
is a C++ library with high-level wrappers in C, Java, table and chemical editor in KDE, and XtalOpt [19,20],
Python, and the .NET environment. Like RDKit and an evolutionary algorithm for crystal structure predic-
other toolkits, Indigo provides support for tetrahedral tion. XtalOpt provides a graphical interface using Avo-
and cis-trans stereochemistry, 2D coordinate generation, gadro and submits calculations using a range of solid-
exact/substructure/SMARTS matching, fingerprint gen- state simulation software to predict stable polymorphs.
eration, and canonical SMILES computation. It also pro- A final example of second-generation Blue Obelisk
vides some less common functionality, like matching software is the AMBIT2 [21,22] software, which was
tautomers and resonance substructures, enumeration of developed to facilitate registration of chemicals for the
subgraphs, finding maximum common substructure of REACH EU directive, and is based on the CDK. It was
N input structures, and enumerating reaction products. distributed initially as a standalone Java Swing GUI, and

J. Cheminf. 2011, 3, 37.


Figure 1 Screenshot of Bioclipse using Jmol to visualise a molecular surface.

more recently as downloadable web application archive, predictive models, including modules of the open source
offering a web services interface to a searchable chemi- Toxtree [22-24] software for toxicity prediction.
cal structures database. Also integrated are descriptor
calculations, as well as the ability to run and build Computational chemistry analysis
Another area where the Blue Obelisk has had a signifi-
cant impact in the past five years is in supporting quan-
tum chemistry calculations and in interpreting their
results. Electronic structure calculations have a long tra-
dition in the chemistry community and a variety of pro-
grams exist, mostly proprietary software but with an
increasing number of open source codes. However,
since each program uses different input formats, and the
the output formats vary widely (sometimes even varying
between different versions of the same software), prepar-
ing calculations and automatically extracting the results
is problematic.
Avogadro has already been mentioned as a GUI for
preparing calculations. It uses Open Babel to read the
output of several electronic structure packages. Avoga-
dro generates input files on the fly in response to user
input on forms, as well as allowing inline editing of the
Figure 2 Screenshot of Avogadro showing a depiction of a
carbon nanotube. files before they are saved to disk. It also features intui-
tive syntax highlighting for GAMESS input files,

J. Cheminf. 2011, 3, 37.


allowing expert users to easily spot mistakes before sav- Quixote will advertise the value of Open community
ing an input file to disk. standards for semantics to the world.
In addition to this, significant development of new The Quixote project is not dependent on any particu-
parsing routines took place in an Avogadro plugin to lar technology, other than the representation of compu-
read in basis sets and electronic structure output in tational chemistry in CML and the management of
order to calculate molecular orbital and electron density semantics through CML dictionaries. At present, we use
grids. This code was written to be parallel, using desk- JUMBO-Converters [29] for most of the semantic con-
top shared memory parallelism and high level APIs in version, Lensfield2 [30] for the workflow and Chem-
order to significantly speed up analysis. Most of this pound (chem#) [31] to store and disseminate the results.
code was recently separated from the plugin, and
released as a BSD licensed library, OpenQube, which is Web applications
now used by the latest version of Avogadro. Jmol (see While desktop software has composed the majority of
below) can also depict computational chemistry results scientific tools since the computer was introduced, the
including molecular orbitals. internet continues to change how applications and con-
In 2006, the Blue Obelisk project cclib [25] was estab- tent are distributed and presented. The web presents
lished with the goal of parsing the output from compu- new opportunities for scientists as it is an open and free
tational chemistry programs and presenting it in a medium to distribute scientific knowledge, ideas and
standard way so that further analyses could be carried education. Web applications are software that runs
out independently of the quantum package used. cclib is within the browser, typically implemented in Java or
a Python library, and the current version (version 1.0.1) JavaScript. Recently, a new version of the HTML specifi-
supports 8 different computational chemistry codes and cation, HTML5, defined a well-developed framework for
extracts over 30 different calculated attributes. Two creating native web applications in JavaScript and this
related Blue Obelisk projects build upon cclib. Gauss- opens up new possibilities for visualising chemical data.
Sum [26], is a GUI that can monitors the progress of Jmol, the interactive 3D molecular viewer, is one of
SCF and geometry convergences, and can plot predicted the most widely used chemistry applets, and indeed has
UV/Vis absorption and infrared spectra from appropri- seen widespread use in other fields such as biology and
ate logfiles containing energies and oscillator strengths even mathematics (it is used for 3D depiction of mathe-
for easy comparison to experimental data. QMForge matical functions in the Sage Mathematics Projects
[27] provides a GUI for various electronic structure ana- [32]). It is implemented in Java, and has gone from
lyses such as Frenking’s charge decomposition analysis being a “Rasmol/Chime” replacement to a fully fledged
[28] and Mulliken or C-squared analyses on user- molecular visualisation package, including full support
defined molecular fragments. QMForge also provides a for crystallography [33], display of molecular orbitals
rudimentary Cartesian coordinate editor allowing mole- from standard basis set/coefficient data, the inclusion of
cular structures to be saved via Open Babel. dynamic minimisation using the UFF force field, and a
The Quixote project epitomises the full use of the full implementation of Daylight SMILES and SMARTS,
Blue Obelisk software and is described in detail in with extensions to conformational and biomolecular
another article in this issue. Here we observe that it is substructure searching (Jmol BioSMARTS).
possible to convert legacy chemistry file formats of all In 2009, iChemLabs released the ChemDoodle Web
sorts into semantic chemistry and extract those parts Components library [34] under the GPL v3 license (with
which are suitable for input to computational chemistry a liberal HTML exception). This library is completely
programs. This chemistry is then combined with generic implemented in JavaScript and uses HTML5 to allow
concepts of computational chemistry (e.g. strategy, the scientist to present publication quality 2D and 3D
machine resources, timing, accuracy etc.) into the legacy graphics (see Figure 3) and animations for chemical
inputs for a wide range of programs. Quixote itself fol- structures, reactions and spectra. Beyond graphics, this
lows Blue Obelisk principles in that it does not manage tool provides a framework for user interaction to create
the submission and monitoring of jobs but resumes dynamic applications through web browsers, desktop
action when the jobs have been completed, and then platforms and mobile devices such as the iPhone, iPad
applies a range of parsing and transformation tools to and Android devices.
create standardised semantic chemical content. A major
feature of Quixote is that it requires all concepts to vali- The business end
date against dictionaries and the process of parsing files Open Source provides a unique opportunity for com-
necessarily generates communally-agreed dictionaries, mercial organisations to work with the cheminformatics
which represent an important step forward in the Open community. Traditional business models rely on moneti-
specifications for Blue Obelisk. When widely-deployed, sation of source code, causing companies to repeat work

J. Cheminf. 2011, 3, 37.


exchange. This way, licensing issues are becoming a
marginal problem, allowing companies to select a license
appropriate for their business model. This too, allows a
company to create a successful product with signifi-
cantly reduced cost and effort.
At the time of writing there are many commercial
companies developing chemistry solutions around Open
Source cheminformatics components provided by the
Blue Obelisk community. Examples of such companies
include iChemLabs, IdeaConsult, Wingu, Silicos, Genet-
taSoft, eMolecules, hBar, Metamolecular, and Inkspot
Figure 3 Screenshot of the MolGrabber 3D demo from Science. Some of these merely use components, but sev-
ChemDoodle Web Components. eral actively contribute back to the Blue Obelisk project
they use, or donate new Open Source cheminformatics
projects to the community.
done by other companies. This model is sometimes For example, iChemLabs released the ChemDoodle
combined with a free (gratis) model for people working Web Components library under the GPL v3 license,
at academic institutes, to increase adoption and encou- based on the upcoming HTML5 Open Standard. It
rage contributions from academics. This solution defines allows making web and mobile interfaces for chemical
the return on investment as the IP on the software, but content. The project is already being adopted by others,
has the downside of investment losses due to duplica- including iBabel [39], ChemSpotlight [40] and the RSC
tion of software and method development, which ChemSpider [41,42].
become visible when proprietary companies merge. Silicos has released several Open Source utilities [43]
Some authors have argued that in the chemistry field based on Open Babel, such as Pharao, a tool for phar-
few contributors are available to volunteer time to macophore searching, Sieve for filtering molecular struc-
improve codes and IP considerations may prevent con- ture by molecular property, Stripper for removing core
tributions from industry [35]. If true, this would hamper scaffold structures from a molecule set, and Piramid for
adoption of Open Source and Open Data in chemistry, molecular alignment using shape determined by the
and greatly slow the growth of projects such as those in Gaussian volumes as a descriptor. Additionally, contri-
the Blue Obelisk. butions have been made to the Open Babel project
The Blue Obelisk community, however, takes advan- itself.
tage of the fact that much of the investment needed for Other companies use Blue Obelisk components and
development is either paid for by academic institutes contribute patches, smaller and larger. For example,
and funding schemes, or by volunteers investing time IXELIS donated the isomorphism code in the CDK,
and effort. In return, contributors get full access to the eMolecules donated canonicalisation code to Open
source code, and the Open Source licensing ensures Babel, Metamolecular improved the extensibility and
that they will have access any time in the future. In this unit testing suite of OPSIN, and AstraZeneca contribu-
way, the license functions as a social contract between ted code to the CDK for signatures. This is just a very
everyone to arrange an immediate return on investment. minor selection, and the reader is encouraged to contact
Effiectively, this approach shares the burden of the high the individual Blue Obelisk projects for a detailed list.
investment in having to develop cheminformatics soft- In May 2011, a Wellcome Trust Workshop on Mole-
ware from scratch, allowing researchers and commercial cular Informatics Open Source Software (MIOSS)
partners alike to focus on their core business, rather explored the role of Open Source in industrial labora-
than the development of prerequisites. In the case of the tories and companies as well as academia (several of the
Blue Obelisk, the rich collection of Open Source che- presenters are among the authors of this paper). The
minformatics tools provided greatly reduces investment meeting identified that Open Source software was extre-
up front for new companies in the cheminformatics mely valuable to industry not just because it is available
market. Such advantages have also been noted in the for free, but because it allows the validation of source
drug discovery field [36-38]. code, data and computational procedures. Some of the
The use of Open Standards allows everyone to select discussion was on business models or other ways to
those Blue Obelisk components they find most useful, maintain development of Open Source software on
as they can easily replace one component with another which a business relied. Companies are concerned about
providing the same functionality, taking advantage that training and support and, in some cases, product liabi-
they use the same standards for, for example, data lity. There are difficulties for software for which there is

J. Cheminf. 2011, 3, 37.


no formal transaction other than downloading and parsing of chemical names, followed by step-wise appli-
agreeing to license terms. One anecdote concerned a cation of nomenclature rules. It is able to offer fast and
company that wished to donate money to an Open precise conversions for the majority of names using
Source project but could not find a mechanism to do so. IUPAC organic nomenclature, and is available as a web
Industry participants also pointed out that there is a service, Java library and standalone application for maxi-
considerable amount of contribution-in-kind from mum interoperability.
industry, both from enhancements to software and also
the development of completely new software and toolk- Chemical database software
its. Companies are now finding it easier to create Registration, indexing and searching of chemical struc-
mechanisms for releasing Open Source software without tures in relational databases is one of the core areas of
violating confidentiality or incurring liability. A phrase cheminformatics. A number of structure registration
from the meeting summed it up: “The ice is beginning systems have been published in the last five years,
to melt”, signifying that we can expect a rapid increase exploiting the fact that Open Source cheminformatics
in industry’s interest in Open Source. toolkits such as Open Babel and the CDK are available.
OrChem [48], for example, is an open source extension
Converting chemical names and images to structures for the Oracle 11G database that adds registration and
The majority of chemical information is not stored in indexing of chemical structures to support fast substruc-
machine-readable formats, but rather as chemical names ture and similarity searching. The cheminformatics
or depictions. The OSRA and OPSIN projects focus on functionality is provided by the CDK. OrChem provides
extracting chemical information from these sources. similarity searching with response times in the order of
Such software plays a particularly important role for seconds for databases with millions of compounds,
data mining the chemical literature, including patents depending on a given similarity cut-off. For substructure
and theses. searching, it can make use of multiple processor cores
Optical Structure Recognition Application (OSRA) on today’s powerful database servers to provide fast
[44] was started in early 2007 with the goal to create response times in equally large data sets.
the first free and open source tool for extraction and Besides the traditional and proven relational database
conversion of molecular images into SMILES and SD approach with added chemical features (’cartridges’),
files. From the very beginning the underlying philosophy there is growing interest in tools and approaches based
was to integrate existing open source libraries and to on the web philosophy and practice. Several groups
avoid “reinventing the wheel” wherever possible. OSRA [49,50] are experimenting with the Resource Description
relies on a variety of open source components: Open Framework (RDF) language on the assumption that gen-
Babel for chemical format conversion and molecular eric high-performance solutions will appear. RDF allows
property calculations, GraphicsMagick for image manip- everything to be described by URIs (data, molecules,
ulation, Potrace for vectorisation, GOCR and OCRAD dictionaries, relations). The Chempound system [31], as
for optical character recognition. The growing impor- deployed in Quixote and elsewhere, is an RDF-based
tance of image recognition technology can be seen in approach to chemical structures and compounds and
the fact that only a few years ago there was only one their properties. For small to medium-sized collections
widely available software package for chemical structure (such as an individual’s calculations or literature retrie-
recognition - CLiDE (commercially developed at Key- val), there are many RDF tools (e.g. SIMILE, Apache
module, Ltd), but today there are as many as seven Jena) which can operate in machine memory and pro-
available programs. vide the flexibility that RDF offers. For larger systems, it
OPSIN (Open Parser for Systematic IUPAC Nomen- is unclear whether complete RDF solutions (e.g. Vir-
clature) [45] focuses instead on interpreting chemical tuoso) will be satisfactory or whether a hybrid system
names. The chemical name is the oldest form of com- based on name-value pairs (e.g. CouchDB, MongoDB)
munication used to describe chemicals, predating even will be sufficient.
the knowledge of the atomic structure of compounds.
Chemical names are abundant in the scientific literature Collaboration and interoperability
and encode valuable structural information. Through One of the successes of the Blue Obelisk has been to
successive books of recommendations [46,47], IUPAC bring developers together from different Open Source
has tried to codify and to an extent standardise naming chemistry projects so that they look for opportunities to
practices. OPSIN aims to make this abundance of che- collaborate rather than compete, and to leverage work
mical names machine readable by translating them to done by other projects to avoid duplication of effort. As
SMILES, CML or InChI. The program is based around an example of this, when in March 2008 the Jmol devel-
the use of a regular grammar to guide tokenisation and opment team were looking to add support for energy

J. Cheminf. 2011, 3, 37.


minimisation, rather than implement a forcefield from the danger of vendor lock-in (where users are con-
scratch they ported the UFF forcefield [51] implementa- strained to using a particular software, a situation which
tion from Open Babel to Jmol. This code enables Jmol puts them at a disadvantage). This applies as much to
to support 2D to 3D conversion of structures (through Open Source software as to proprietary software. Cinf-
energy minimisation). In a similar manner, efficient Jmol ony is a project (first release in May 2008) whose goal is
code for atom-atom rebonding has been ported to the to tackle this problem in the area of cheminformatics
CDK. Figure 4 shows the collaborative nature of soft- toolkits [56]. It is a Python library that enables Open
ware developed in the Blue Obelisk, as one project Babel, the CDK, and RDKit (and shortly, Indigo and
builds on functionality provided by another project. OPSIN) to be used using the same API; this makes it
Another collaborative initiative between Blue Obelisk easy, for example, to read a molecule using Open Babel,
projects was the establishment in May 2008 of the Che- calculate descriptors using the CDK and create a depic-
miSQL project. This brought together the developers of tion using RDKit.
several open source chemistry database cartridges Another way through which interoperability of Blue
(PgChem [52], Mychem [53], OrChem [48] and more Obelisk projects has been promoted and developed is
recently Bingo [54]) with a view to making their data- through integration into workflow software such as
base APIs more similar and collaborating on benchmark Taverna [57] and KNIME [58] (both open source). Such
datasets for assessing performance. For two of these software makes it easy to automate recurring tasks, and
projects, PgChem and Mychem, which are both based to combine analyses or data from a variety of different
on Open Babel, there is the additional possibility of software and web services. A combination of the Chem-
working together on a shared codebase. istry Development Kit and Taverna, for instance, was
In the area of cheminformatics toolkits, two of the reported in 2010 [59]. In the case of KNIME, it comes
existing toolkits Open Babel and RDKit are planning to with built-in basic collection of CDK-based and Open
work together on a common underlying framework Babel-based nodes, while other nodes for the RDKit and
called MolCore [55]. This project is still in the planning Indigo are available from KNIME’s “Community
stage, but if it is a success it will mean that the the two Updates” site.
libraries will be interoperable (while retaining their
existing focus) but also that the cost of maintaining the Open Standards
code will be shared among more developers, freeing Chemical Markup Language, CML
time for the development of new features. Chemical Markup Language (CML) is discussed in sev-
One of the goals of the Blue Obelisk is to promote eral articles in this issue, and a brief summary here re-
interoperability in chemical informatics. When barriers iterates that it is designed primarily to create a validata-
exist to moving chemical data between different soft- ble semantic representation for chemical objects. The
ware, the community becomes fragmented and there is five main areas (molecules, reactions, computational

Figure 4 Dependency diagram of some Blue Obelisk projects. Each block represents a project. Square blocks show Open Data, ovals are
Open Source, and diamonds are Open Standards.

J. Cheminf. 2011, 3, 37.


chemistry, spectra and solid-state (see above)) have now calls to libraries written in languages such as C, C++
all been extensively deployed and tested. CML can and Fortran, and compiled into native, machine specific,
therefore be used as a reference for input and output code. JNI-InChI provides a thin C wrapper, with corre-
for Blue Obelisk software and a means of representing sponding Java code, around the IUPAC InChI library,
data in Blue Obelisk resources. exposing the InChI library’s functionality to the JVM.
CML, being an XML application, can inter-operate To overcome the need to have the correct InChI library
with other markup languages and in particular XHTML, pre-installed on a system, JNI-InChI comes with a vari-
SVG, MathML, docx and more specialised applications ety of precompiled native binaries and automatically
such as UnitsML and GML (geosciences). We believe extracts and deploys the correct one for the detected
that it would be possible using these languages to operating system and architecture. The JNI-InChI
encode large parts of, say, first year chemistry text library comes with native binaries supporting a range of
books in XML. Similarly, it is possible to create com- operating systems and architectures; the current version
pound documents with word processing or spreadsheet has binaries for 32- and 64-bit Windows, Linux and
software that have inter-operating text, graphics and Solaris, 64-bit FreeBSD and 64-bit Intel-based Mac OS
chemistry (as in Chem4Word). Being a markup lan- X - a number of which are not supported by the original
guage, CML is designed for re-purposing, including sty- IUPAC distribution of InChI. The JNI-InChI project has
ling, and therefore a mixture of these languages can be matured to support the full range of functionality of the
used for chemical catalogues, general publications, log- InChI C library: structure-to-InChI, InChI-to-structure,
books and many other types of document in the scienti- AuxInfo-to-structure, InChIKey generation, and InChI
fic process. and InChIKey validation. JNI-InChI provides the InChI
CML describes much of its semantics through conven- functionality for a number of Open Source projects,
tions and dictionaries, and the emerging ecosystem including the Chemistry Development Kit, Bioclipse and
(especially in computational chemistry) is available as a CMLXOM/JUMBO, and is also used by commercial
semantic resource for many of the applications and spe- applications and internally in a number of companies.
cifications in this article. Through its widespread use and Open Source develop-
ment model, a number of issues in earlier versions of
InChI the software have been identified and resolved, and JNI-
The IUPAC InChI identifier is a non-proprietary and InChI now offers a robust tool for working with InChIs
unique identifier for chemical substances designed to in the JVM.
enable linking of diverse data compilations. Prior to the
development of the InChI identifier chemical informa- OpenSMILES
tion systems and databases used a wide variety of (gen- One of the most widely used ways to store chemical
erally proprietary) identifiers, greatly limiting their structures is the SMILES format (or SMILES string).
interoperability. Although its development predates the This is a linear notation developed by Daylight Informa-
Blue Obelisk, software such as Open Babel has included tion Systems that describes the connection table of a
InChI support since 2005, and support for InChI in molecule and may optionally encode chirality. Its popu-
Indigo is due in 2011. larity stems from the fact that it is a compact represen-
Since the official InChI implementation is in C, it is tation of the chemical structure that is human readable
difficult to access from the other widely used language and writable, and is convenient to manipulate (e.g. to
for cheminformatics toolkits, Java. Early attempts to include in spreadsheets, or copy from a web page).
generate InChI identifiers from within Java involved Despite its widespread use, a formal definition of the
programatically launching the InChI executable and cap- language did not exist beyond Daylight’s SMILES The-
turing the output, an approach that was found to be ory Manual and tutorials. This caused some confusion
fairly unreliable and broke the ‘write once, run any- in the implementation and interpretation of corner
where’ philosophy of Java. The Blue Obelisk project JNI- cases, for example the handling of cis/trans bond sym-
InChI [60] was established in 2006 to solve this problem bols at ring closures. In 2007, Craig James (eMolecules)
by using the Java Native Interface framework to provide initiated work on the OpenSMILES specification [61], a
transparent access to the InChI library from within Java complete specification of the SMILES language as an
and other Java Virtual Machine (JVM) based languages, Open Standard developed through a community pro-
supporting the wider adoption of this standard identifier cess. The specification is largely complete and contains
by the chemistry community. guidelines on reading SMILES, a formal grammar,
The Java Native Interface framework provides a recommendations on standard forms when writing
mechanism for code running inside the JVM, to place SMILES, as well as proposed extensions.

J. Cheminf. 2011, 3, 37.


QSAR-ML has their own set of rules. A common reference specifi-
The field of QSAR has long been hampered by the lack cation for standardisation would be of immense value in
of open standards, which makes it difficult to share and interoperability between structure repositories as well as
reproduce descriptor calculations and analyses. QSAR- between toolkits (though the latter is still confounded
ML was recently proposed as an open standard for by differences in lower level cheminformatic features
exchanging QSAR datasets [62]. A dataset in QSAR-ML such as aromaticity models).
includes the chemical structures (preferably described in We have already discussed the development of an
CML) with InChI to protect integrity, chemical descrip- Open SMILES standard. While much progress has been
tors linked to the Blue Obelisk Descriptor Ontology made towards a complete specification, more remains to
[63], response values, units, and versioned descriptor be done before this can be considered finished. After
implementations to allow descriptors from different that point, the next logical step would be to start work
software to be integrated into the same calculation. on a standard for the SMARTS language, the extension
Hence, a dataset described in QSAR-ML is completely to SMILES that specifies patterns that match chemical
reproducible. To allow for easy setup of QSAR-ML substructures.
compliant datasets, a plugin for Bioclipse was created
with a graphical interface for setting up QSAR datasets Open Data
and performing calculations. Descriptor implementa- A considerable stumbling block in advocating the
tions are available from the CDK and JOELib, as well as release of scientific data as Open Data has been how
via remote web services such as XMPP [64]. exactly to define “Open.” A major step forward was the
launch in 2010 of the Panton Principles for Open Data
Remaining challenges in Science [66]. This formalises the idea that Open Data
A core requirement for chemical structure databases maximises the possibility of reuse and repurposing, the
and chemical registration systems in general is the fundamental basis of how science works. These princi-
notion of structure standardisation. That is, for a given ples recommend that published data be licensed expli-
input structure, multiple representations should be con- citly, and preferably under CC0 (Creative Commons ‘No
verted to one canonical form. Structure canonicalisation Rights Reserved’, also known as CCZero) [67]. This
routines partially address this aspect, converting multi- license allows others to use the data for any purpose
ple alternative topologies to a single canonical form. whatsoever without any barriers. Other licenses compa-
However, the problem of standardisation is broader tible with the Panton Principles include the Open Data
than just topological canonicalisation. Features that Commons Public Domain Dedication and Licence
must be considered include (PDDL), the Open Data Commons Attribution License,
and the Open Data Commons Open Database License
• topological canonicalisation (ODbL) [68].
• handling of charges Despite this positive news, little chemical data compa-
• tautomer enumeration and canonicalisation tible with these principles has become available from
• normalisation of functional groups the traditional chemistry fields of organic, inorganic, and
solid state chemistry. Table 2 lists a few notable excep-
Currently, most of the individual components of a tions, some of which are discussed further below. There
‘standardisation pipeline’ can be implemented using is also data available using licenses not compatible with
Blue Obelisk tools. The larger problem is that there is the Panton Principles, but where the user is allowed to
no agreed upon list of steps for a standardisation pro- modify and redistribute the data. A new data set in this
cess. While some specifications have been published (e. category is the data from the ChEMBL database [69],
g., PubChem) and some standardisation services and which is available under the Creative Commons Share-
tools are available (for example, PubChem provides an Alike Attribution license. The RSC ChemSpider data-
online service to standardise molecules [65]) each group base [41], although not fully Open, also hosts Open

Table 2 Open Data in chemistry.
Name License/Waiver Description
Chempedia [98] CC0 Crowd-sourced chemical names (project discontinued but data still available)
CrystalEye PPDL Crystal structures from primary literature
ONS Solubility CC0 Solubility data for various solvents
Reaction Attempts CC0 Data on successful and unsuccessful reactions
Overview of major open chemical data available under a license or waiver compatible with the Panton Principles.

J. Cheminf. 2011, 3, 37.


Data; for example, spectral data when deposited can be are available under a CC0 license. Several web services
marked as Open. and feeds are available to filter and re-use the dataset
Importantly, publishing data as CC0 is becoming [80]. In particular, models have been developed for the
easier now that websites are becoming available to sim- prediction of non-aqueous solubility in 72 different sol-
plify publishing data. Two projects that can be men- vents [81] using the method of Abraham et al [82] with
tioned in this context are FigShare [70], where the data descriptors calculated by the Chemistry Development
behind unpublished figures can be hosted, and Dryad Kit. These models are available online and will be
[71] where data behind publications can be hosted. refined as more solubility data is collected.
Initiatives like this make it possible to host small
amounts of data, and those combined are expected to The Blue Obelisk Data Repository (BODR)
become soon a substantial knowledge base. The Blue Obelisk has created a repository of key chemi-
cal data in a machine-readable format [83]. The BODR
Reaction Attempts focuses on data that is commonly required for chemistry
Although there are existing databases that allow for software, and where there is a need to ensure that
searching reactions, those using Open Data are harder values are standard between codes. Examples are atomic
to find. The Reaction Attempts database [72], to which masses and conversions between physical constants.
anyone can submit reaction attempts data, consists These data can be used by others for any purpose (for
mainly of reaction information abstracted from Open example, for entry into Wikipedia or use in in-house
Notebooks in organic chemistry, such as the Useful- software), and should lead to an enhancement in the
Chem project from the Bradley group [73] and the note- quality of community reference data. The Blue Obelisk
books from the Todd group [74]. Key information from provides also a complementary project, the Chemical
each experiment is abstracted manually, with the only Structure Repository [83]. It aims to provide 3D coordi-
required information consisting of the ChemSpider IDs nates, InChIs and several physico-chemical descriptors
of the reactants and the product targeted in the experi- for a set of 570 organic compounds.
ment; and a link to the laboratory notebook page. Infor-
mation in the database can be searched and accessed NMRShiftDB
using the web-based Reaction Attempts Explorer [75]. NMRShiftDB [84,85] represents one of the earliest
Since the database reflects all data from the note- resources for Open community-contributed data (first
books, it includes experiments in progress, ambiguous released in 2003). Research groups that measure NMR
results and failed runs. Unlike most reaction databases spectra or extract it from the literature can contribute
that only identify experiments successfully reported in that information to NMRShiftDB which provides an
the literature, the Reaction Attempts Explorer allows Open resource where entries can be searched by chemi-
researchers to easily find patterns in reactions that have cal structure or properties (especially peaks). Although
already been performed, and since the data are open it is difficult to encourage large amounts of altruistic
and results are reported across all research groups, contribution (as happens with Wikipedia), an alternative
intersections are easily discovered and possible Open possible source of data could come from linking data
Collaboration opportunities are easily found [76,77]. capture with data publication. For example, the Blue
Obelisk has enough software that it is possible to create
Non-Aqueous Solubility a seamless chain for converting NMR structures in-
Although the aqueous solubility of many common house into NMRShiftDB entries. If and when the chem-
organic compounds is generally available, quantitative istry community encourages or requires semantic publi-
reports of non-aqueous solubility are more difficult to cation of spectra rather than PDFs, it would be possible
find. Such information can be valuable for selecting sol- to populate NMRShiftDB rapidly along the the lines of
vents for reactions, re-crystallization and related pro- CrystalEye (see below). A similar approach has been
cesses. In 2008, the Open Notebook Science Solubility demonstrated earlier using the Blue Obelisk components
Challenge was launched for the purpose of measuring Oscar and Bioclipse using text mining approaches [86].
non-aqueous solubility of organic compounds, reporting
all the details of the experiments in an Open Notebook CrystalEye
and recording the results as Open Data in a centralized CrystalEye [87] is an example of cost-effiective extrac-
database [78,79]. This crowdsourcing project was also tion of data from the literature where this is published
supported by Submeta, Sigma-Aldrich, Nature Publish- both Openly and semantically. Software extracts
ing Group and the Royal Society of Chemistry. The Openly-published crystal structures from a variety of
database currently holds 1932 total measurements and scholarly journals, processes them and then makes them
1428 averaged solute/solvent measurements all of which available through a web interface. It currently contains

J. Cheminf. 2011, 3, 37.


about 250,000 structures. CrystalEye serves as a model with the wider chemistry community outside of the Blue
for a high-value, high-quality Open data resource, Obelisk remains an open question. If the Blue Obelisk is
including the licensing of each component as Panton- truly to make an impact, then an attempt must be made
compatible Open data. to reach beyond the subscribers to the Blue Obelisk
mailing list and blogs of members.
Other areas of activity We hope to see this involvement between the Blue
While each Blue Obelisk project has its own website and Obelisk and the wider community grow in the future.
point of contact (typically a mailing list), because of the To this end, we encourage the reader to visit the Blue
breadth of Blue Obelisk projects it can be difficult for a Obelisk website [94], send a message to our mailing list,
newcomer to understand which of them, if any, can best investigate related projects or read our blogs.
address a particular problem. To address this issue,
members of the Blue Obelisk established a Question
Acknowledgements
Answer website [88] (see Figure 5). This is a website in NMOB is supported by a Health Research Board Career Development
the style of Stack Overflow [89] that encourages high Fellowship (PD/2009/13). The OSRA project has been funded in whole or in
quality answers (and questions) through the use of a part with federal funds from the National Cancer Institute, National Institutes
of Health, under contract HHSN261200800001E. The content of this
voting system. In the year since it was established, over publication does not necessarily reflect the views of the policies of the
200 users have registered, many of whom had no pre- Department of Health and Human Services, nor does mention of trade
vious involvement with the Blue Obelisk, showing that names, commercial products, or organisations imply endorsement by the US
Government.
the QA website complements earlier existing channels
of communication. Author details
1
The rise of self-publishing and print-on-demand ser- Analytical and Biological Chemistry Research Facility, Cavanagh Pharmacy
Building, University College Cork, College Road, Cork, Co. Cork, Ireland. 2NIH
vices has meant that publishing a book is now as Center for Translational Therapeutics, 9800 Medical Center Drive, Rockville,
straightforward as uploading to an appropriate website. MD 20878, USA. 3Division of Molecular Toxicology, Institute of Environmental
Unlike the traditional publishing route where books Medicine, Nobels väg 13, Karolinska Institutet, 171 77 Stockholm, Sweden.
4
Unilever Centre for Molecular Sciences Informatics, Department of
with projected low sales volume would be expensive, Chemistry, University of Cambridge, Lensfield Road, CB2 1EW, UK.
websites such as Lulu [90] allow the sale of low-priced 5
Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 751
books on chemistry software, and books are now avail- 24 Uppsala, Sweden. 6Department of Chemistry, Drexel University, 32nd and
Chestnut streets, Philadelphia, PA 19104, USA. 7Chemical Biology Laboratory,
able for purchase on Jmol [91], the Chemistry Develop- Basic Research Program, SAIC-Frederick, Inc., NCI-Frederick, Frederick, MD
ment Kit [92] and Open Babel [93]. 21702, USA. 8St. Olaf College, 1520 St. Olaf Ave., Northfield, MN 55057, USA.
9
Kitware, Inc., 28 Corporate Drive, Clifton Park, NY 12065, USA. 10Department
of Chemistry, University of Pittsburgh, 219 Parkman Avenue, Pittsburgh, PA
Conclusions 15260, USA. 11eMolecules Inc., 380 Stevens Ave., Solana Beach, California
We have shown that the Blue Obelisk has been very 92075, USA. 12Ideaconsult Ltd., 4.A.Kanchev str., Sofia 1000, Bulgaria.
13
successful in bringing together researchers and develo- Department of Engineering, Computer Science, Physics, and Mathematics,
Oral Roberts University, 7777 S. Lewis Ave. Tulsa, OK 74171, USA. 14Leiden
pers with common interests in ODOSOS, leading to Institute of Chemistry, Leiden University, Einsteinweg 55, 2333 CC Leiden,
development of many useful resources freely available to The Netherlands. 15Department of Chemistry, State University of New York at
the chemistry community. However, how best to engage Buffalo, Buffalo, NY 14260-3000, USA. 16Université de Strasbourg, IPHC, CNRS,
UMR7178, 23 rue du Loess 67037, Strasbourg, France. 17GGA Software
Services LLC, 41 Nab. Chernoi rechki 194342, Saint Petersburg, Russia.
18
Cheminformatics and Metabolism Team, European Bioinformatics Institute
(EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.
19
Department of Chemistry, University of Washington, Seattle, WA 98195,
USA. 20iChemLabs, 200 Centennial Ave., Suite 200, Piscataway, NJ 08854,
USA.

Authors’ contributions
The overall layout of the manuscript grew from discussions between NMOB,
RG and ELW. The authorship of the paper is drawn from those people
connected with fully Open Data/Standards/Source (OSI-compliant or OKF-
compliant) projects associated with the Blue Obelisk. There are a large
number of people contributing to these projects and because those projects
are published in their own right it is not appropriate to include all their
developers by default. We invited a number of ‘project gurus’ who have
been active in promoting the Blue Obelisk, to be authors on this paper and
most have accepted and contributed.

Competing interests
The authors declare that they have no competing interests.
Figure 5 Screenshot of the Blue Obelisk eXchange Question Received: 1 July 2011 Accepted: 14 October 2011
and Answer website. Published: 14 October 2011

J. Cheminf. 2011, 3, 37.


References 34. ChemDoodle Web Components: HTML5 Chemistry. [http://web.
1. Matos PD, Alcantara R, Dekker A, Ennis M, Hastings J, Haug K, Spiteri I, chemdoodle.com].
Turner S, Steinbeck C: Chemical Entities of Biological Interest: an update. 35. Stahl MT: Open-source software: not quite endsville. Drug Discov Today
Nucleic Acids Res 2009, 38:D249-D254. 2005, 10:219-22.
2. Murray-Rust P: The Blue Obelisk. CDK News 2005, 2:43-46. 36. DeLano WL: The case for open-source software in drug discovery. Drug
3. Guha R, Howard MT, Hutchison GR, Murray-Rust P, Rzepa H, Steinbeck C, Discov Today 2005, 10:213-7.
Wegner J, Willighagen EL: The Blue Obelisk - Interoperability in Chemical 37. Munos B: Can open-source RD reinvigorate drug research? Nat Rev Drug
Informatics. J Chem Inf Model 2006, 46:991-998. Discov 2006, 5:723-9.
4. Murray-Rust P, Rzepa HS: Chemical Markup, XML, and the Worldwide 38. Geldenhuys WJ, Gaasch KE, Watson M, Allen DD, Van der Schyf CJ:
Web. 1. Basic Principles. J Chem Inf Comput Sci 1999, 39:928-942. Optimizing the use of open-source software applications in drug
5. Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E: The discovery. Drug Discov Today 2006, 11:127-32.
Chemistry Development Kit (CDK): An Open-Source Java Library for 39. iBabel. [http://homepage.mac.com/swain/Sites/Macinchem/page65/ibabel3.
Chemo- and Bioinformatics. J Chem Inf Comput Sci 2003, 43:493-500. html].
6. Steinbeck C, Hoppe C, Kuhn S, Floris M, Guha R, Willighagen EL: Recent 40. ChemSpotlight. [http://chemspotlight.openmolecules.net].
developments of the chemistry development kit (CDK) - an open-source 41. ChemSpider - the free chemical database. [http://www.chemspider.com].
java library for chemo- and bioinformatics. Curr Pharm Design 2006, 42. iChemLabs and RSC ChemSpider announce partnership. [http://www.
12:2111-2120. chemspider.com/blog/ichemlabs-and-rsc-chemspider-announce-partnership.
7. Open Babel. [http://openbabel.org]. html].
8. JOELib. [http://sf.net/projects/joelib]. 43. Silicos Open Source Software. [http://silicos.silicos-it.com/download.html].
9. Rahman SA, Bashton M, Holliday GL, Schrader R, Thornton JM: Small 44. OSRA. [http://cactus.nci.nih.gov/osra/].
Molecule Subgraph Detector (SMSD) Toolkit. J Cheminf 2009, 1:12. 45. Lowe DM, Corbett PT, Murray-Rust P, Glen RC: Chemical Name to
10. Dong X, Gilbert K, Guha R, Heiland R, Kim J, Pierce M, Fox G, Wild D: A Structure: OPSIN, an Open Source Solution. J Chem Inf Model 2011,
Web Service Infrastructure for Chemoinformatics. J Chem Inf Model 2007, 51:739-753.
47:1303-1307. 46. IUPAC: Nomenclature of Organic Chemistry Pergamon Press, Oxford; 1979.
11. RDKit. [http://rdkit.org]. 47. IUPAC: A Guide to IUPAC Nomenclature of Organic Compounds
12. Indigo. [http://ggasoftware.com/opensource/indigo]. (Recommendations 1993) Blackwell Scientific publications, Oxford; 1993.
13. Open Babel - Related Software. [http://openbabel.org/wiki/ 48. Rijnbeek M, Steinbeck C: OrChem - An open source chemistry search
Related_Projects]. engine for Oracle(R). J Cheminf 2009, 1:17.
14. Spjuth O, Helmus T, Willighagen EL, Kuhn S, Eklund M, Wagener J, Murray- 49. Willighagen EL, Brändle MP: Resource description framework technologies
Rust P, Steinbeck C, Wikberg JES: Bioclipse: an open source workbench in chemistry. J Cheminf 2011, 3:15.
for chemo- and bioinformatics. BMC Bioinformatics 2007, 8:59. 50. Chen B, Dong X, Jiao D, Wang H, Zhu Q, Ding Y, Wild DJ: Chem2Bio2RDF:
15. Avogadro: an open-source molecular builder and visualization tool. a semantic framework for linking and data mining chemogenomic and
[http://avogadro.openmolecules.net]. systems chemical biology data. BMC Bioinformatics 2010, 11:255.
16. Spjuth O, Alvarsson J, Berg A, Eklund M, Kuhn S, Masak C, Torrance G, 51. Rappé AK, Casewit CJ, Colwell KS, Goddard WA III, Skiff WM: UFF, a full
Wagener J, Willighagen E, Steinbeck C, Wikberg J: Bioclipse 2: A periodic table force field for molecular mechanics and molecular
scriptable integration platform for the life sciences. BMC Bioinformatics dynamics simulations. J Am Chem Soc 1992, 114:10024-10035.
2009, 10:397. 52. PgChem. [http://pgfoundry.org/projects/pgchem/].
17. Alvarsson J, Andersson C, Spjuth O, Larsson R, Wikberg J: Brunn: An open 53. Mychem. [http://mychem.sf.net].
source laboratory information system for microplates with a graphical 54. Bingo. [http://ggasoftware.com/opensource/bingo].
plate layout design process. BMC Bioinformatics 2011, 12:179. 55. MolCore. [http://molcore.sf.net].
18. Kalzium - Periodic Table and Chemistry in KDE. [http://edu.kde.org/ 56. O’Boyle NM, Hutchison GR: Cinfony-combining Open Source
applications/science/kalzium/]. cheminformatics toolkits behind a common interface. Chem Cent J 2008,
19. XtalOpt - Evolutionary Crystal Structure Prediction. [http://xtalopt. 2:24.
openmolecules.net]. 57. Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T:
20. Lonie DC, Zurek E: XtalOpt: An open-source evolutionary algorithm for Taverna: a tool for building and running workflows of services. Nucleic
crystal structure prediction. Comput Phys Commun 2011, 182:372-387. Acids Res 2006, 34:W729-W732.
21. Jeliazkova N, Jeliazkov V: AMBIT RESTful web services: an implementation 58. KNIME. [http://www.knime.org].
of the OpenTox application programming interface. J Cheminf 2011, 3:18. 59. Kuhn T, Willighagen E, Zielesny A, Steinbeck C: CDK-Taverna: an open
22. Jeliazkova N, Jaworska J, Worth A: Open Source Tools for Read-Across and workflow environment for cheminformatics. BMC Bioinformatics 2010,
Category Formation. In In Silico Toxicology : Principles and Applications. 11:159.
Edited by: Cronin M, Madden J. Cambridge UK: RSC Publishing; 60. JNI-InChI. [http://jni-inchi.sf.net/index.html].
2010:408-445. 61. The OpenSMILES specification. [http://opensmiles.org].
23. ToxTree. [http://toxtree.sf.net]. 62. Spjuth O, Willighagen EL, Guha R, Eklund M, Wikberg JE: Towards
24. Patlewicz G, Jeliazkova N, Safford RJ, Worth AP, Aleksiev B: An evaluation of interoperable and reproducible QSAR analyses: Exchange of datasets. J
the implementation of the Cramer classification scheme in the Toxtree Cheminf 2010, 2:5.
software. SAR QSAR Environ Res 2008, 19:495-524. 63. The Blue Obelisk Descriptor Ontology. [http://qsar.sourceforge.net/dicts/
25. O’Boyle NM, Tenderholt AL, Langner KM: cclib: A library for package- qsar-descriptors/index.xhtml].
independent computational chemistry algorithms. J Comp Chem 2008, 64. Wagener J, Spjuth O, Willighagen EL, Wikberg JES: XMPP for cloud
29:839-845. computing in bioinformatics supporting discovery and invocation of
26. GaussSum. [http://gausssum.sf.net]. asynchronous web services. BMC Bioinformatics 2009, 10:279.
27. QMForge. [http://qmforge.sf.net]. 65. PubChem Standardization Service. [http://pubchem.ncbi.nlm.nih.gov//
28. Dapprich S, Frenking G: Investigation of Donor-Acceptor Interactions: A standardize/standardize.cgi].
Charge Decomposition Analysis Using Fragment Molecular Orbitals. J 66. Panton Principles - Principles for Open Data in Science. [http://
Phys Chem 1995, 99:9352-9362. pantonprinciples.org].
29. JUMBO-Converters. [https://bitbucket.org/wwmm/jumbo-converters]. 67. About CC0 - “No Rights Reserved”. [http://creativecommons.org/about/
30. Lensfield 2. [https://bitbucket.org/sea36/lensfield2]. cc0].
31. Chempound. [https://bitbucket.org/chempound/chempound]. 68. Open Licenses - Data. [http://www.opendefinition.org/licenses/#Data].
32. Stein W, et al: Sage Mathematics Software The Sage Development Team; 69. Overington J: ChEMBL. An interview with John Overington, team leader,
2011 [http://www.sagemath.org]. chemogenomics at the European Bioinformatics Institute Outstation of
33. Hanson RM: Jmol - a paradigm shift in crystallographic visualization. J the European Molecular Biology Laboratory (EMBL-EBI). Interview by
Appl Cryst 2010, 43:1250-1260. Wendy A. Warr. J Comp Aided Mol Des 2009, 23:195-198.

J. Cheminf. 2011, 3, 37.


70. FigShare. [http://figshare.com].
71. Dryad. [http://datadryad.org].
72. Reaction Attempts Database. [http://onswebservices.wikispaces.com/
reactions].
73. Bradley JC: Useful Chemistry: Reaction Attempts Book Edition 1 and
UsefulChem Archive.[http://usefulchem.blogspot.com/2010/04/reaction-
attempts-book-edition-1-and.html].
74. Bradley JC: Useful Chemistry: The Synaptic Leap Experiments on
Reaction Attempts.[http://usefulchem.blogspot.com/2010/05/synaptic-leap-
experiments-on-reaction.html].
75. Bradley JC: Useful Chemistry: Reaction Attempts Explorer.[http://
usefulchem.blogspot.com/2010/06/reaction-attempts-explorer.html].
76. Bradley JC: Useful Chemistry: Visualizing Social Networks in Open
Notebooks.[http://usefulchem.blogspot.com/2010/12/visualizing-social-
networks-in-open.html].
77. Bradley JC, Lang AS, Koch S, Neylon C: Collaboration using Open
Notebook Science in Academia. In Collaborative computational
technologies for biomedical research. Edited by: Ekins S, Hupcey MA, Williams
AJ. Hoboken N.J.: John Wiley 2011:425-452.
78. Bradley JC, Neylon C, Guha R, Williams AJ, Hooker B, Lang ASID, Friesen B,
Bohinski T, Bulger D, Federici M, Hale J, Mancinelli J, Mirza KB, Moritz MJ,
Rein D, Tchakounte C, Truong HT: Open Notebook Science Challenge:
Solubilities of Organic Compounds in Organic Solvents. Nature Precedings
2010 [http://dx.doi.org/10.1038/npre.2010.4243.3].
79. Bradley J, Guha R, Lang A, Lindenbaum P, Neylon C, Williams A,
Willighagen E: Beautifying Data in the Real World. In Beautiful Data.. 1
edition. Edited by: Segaran T, Hammerbacher J. Sebastopol CA: O’Reilly;
2009:259-278.
80. Open Notebook Solubility Web Services. [http://onswebservices.
wikispaces.com/solubility].
81. Bradley J: Useful Chemistry: General Transparent Solubility Prediction
using Abraham Descriptors.[http://usefulchem.blogspot.com/2010/07/
general-transparent-solubility.html].
82. Abraham MH, Smith RE, Luchtefeld R, Boorem AJ, Luo R, Acree WE Jr:
Prediction of solubility of drugs and other compounds in organic
solvents. J Pharm Sci 2010, 99:1500-1515.
83. Blue Obelisk Data Repository. [http://bodr.sf.net].
84. NMRShiftDB. [http://www.nmrshiftdb.org].
85. Steinbeck C, Kuhn S: NMRShiftDB - compound identification and
structure elucidation support through a free community-build web
database. Phytochemistry 2004, 65:2711-2717.
86. Willighagen EL: Chemical Archeology: OSCAR3 to NMRShiftDB.org.[http://
chem-bla-ics.blogspot.com/2006/09/chemical-archeology-oscar3-to.html].
87. CrystalEye. [http://wwmm.ch.cam.ac.uk/crystaleye/].
88. Blue Obelisk QA. [http://blueobelisk.shapado.com].
89. Stack Overflow. [http://stackoverflow.com].
90. Lulu. [http://lulu.com].
91. Herráez A: In How to use Jmol to study and present molecular structures.
Volume 1. Lulu Enterprises, Morrisville, NC, US; 2007.
92. Willighagen E: Groovy Cheminformatics with the Chemistry Development Kit
Lulu Enterprises, Morrisville, NC, US; 2011.
93. Hutchison GR, Morley C, O’Boyle NM, James C, Swain C, De Winter H,
Vandermeersch T: Open Babel - Official User Guide Lulu Enterprises,
Morrisville, NC, US; 2011.
94. Blue Obelisk web site. [http://blueobelisk.org].
95. Day N, Murray-Rust P, Tyrrell S: CIFXML: a schema and toolkit for
managing CIFs in XML. J Appl Cryst 2011, 44:628-634.
96. Hawizy L, Jessop D, Adams N, Murray-Rust P: ChemicalTagger: A tool for
Semantic Text-mining in Chemistry. J Cheminf 2011, 3:17.
97. O’Boyle NM, Vandermeersch T, Flynn CJ, Maguire AR, Hutchison GR: Confab scientist can read your work free of charge
- Systematic generation of diverse low-energy conformers. J Cheminf
2011, 3:8.
98. Chempedia. [http://chempedia.com]. colleagues in other parts of the globe, by allowing
doi:10.1186/1758-2946-3-37
Cite this article as: O’Boyle et al.: Open Data, Open Source and Open
Standards in chemistry: The Blue Obelisk five years on. Journal of available free of charge to the entire scientific community
Cheminformatics 2011 3:37.

J. Cheminf. 2011, 3, 37.

My Open Access papers

More Related Content

Viewers also liked (15)

Similar to My Open Access papers (20)

More from baoilleach (20)

Recently uploaded (20)

My Open Access papers