SlideShare a Scribd company logo
Open Access Publications of
      Noel O’Boyle



       November 2, 2011
My Open Access papers
Contents

I     Cheminformatics toolkits                                                             5
1 Pybel: a Python wrapper for the OpenBabel cheminformatics toolkit                        7

2 Cinfony - combining Open Source cheminformatics toolkits behind a common interface 15

3 Open Babel: An open chemical toolbox                                                    25


II    Enzyme reaction mechanisms                                                          39
4 MACiE: a database of enzyme reaction mechanisms                                         41

5 MACiE (Mechanism, Annotation and Classification in Enzymes): novel tools for search-
  ing catalytic mechanisms                                                            43


III    QSAR                                                                               49
6 PYCHEM: a multivariate analysis package for python                                      51

7 Simultaneous feature selection and parameter optimisation using an artificial ant colony:
  case study of melting point prediction                                                   53


IV     The Rest                                                                           69
8 Userscripts for the life sciences                                                       71

9 Confab - Systematic generation of diverse low-energy conformers                         83

10 Review of “Data Analysis with Open Source Tools”                                       93

11 Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years
   on                                                                                 95




                                              3
My Open Access papers
Part I

Cheminformatics toolkits




           5
My Open Access papers
Chemistry Central Journal
 Software                                                                                                                               Open Access
 Pybel: a Python wrapper for the OpenBabel cheminformatics
 toolkit
 Noel M O'Boyle*1,2, Chris Morley3 and Geoffrey R Hutchison4

 Address: 1Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2
 1EW, UK, 2Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, UK, 3OpenBabel Development Team and 4Department
 of Chemistry, University of Pittsburgh, Chevron Science Center, 219 Parkman Avenue, Pittsburgh, PA 15260, USA
 Email: Noel M O'Boyle* - baoilleach@gmail.com; Chris Morley - c.morley@gaseq.co.uk; Geoffrey R Hutchison - geoffh@pitt.edu
 * Corresponding author




 Published: 9 March 2008                                                             Received: 23 January 2008
                                                                                     Accepted: 9 March 2008
 Chemistry Central Journal 2008, 2:5   doi:10.1186/1752-153X-2-5
 This article is available from: http://journal.chemistrycentral.com/content/2/1/5
 © 2008 O'Boyle et al
 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
 which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.




                   Abstract
                   Background: Scripting languages such as Python are ideally suited to common programming tasks
                   in cheminformatics such as data analysis and parsing information from files. However, for reasons
                   of efficiency, cheminformatics toolkits such as the OpenBabel toolkit are often implemented in
                   compiled languages such as C++. We describe Pybel, a Python module that provides access to the
                   OpenBabel toolkit.
                   Results: Pybel wraps the direct toolkit bindings to simplify common tasks such as reading and
                   writing molecular files and calculating fingerprints. Extensive use is made of Python iterators to
                   simplify loops such as that over all the molecules in a file. A Pybel Molecule can be easily
                   interconverted to an OpenBabel OBMol to access those methods or attributes not wrapped by
                   Pybel.
                   Conclusion: Pybel allows cheminformaticians to rapidly develop Python scripts that manipulate
                   chemical information. It is open source, available cross-platform, and offers the power of the
                   OpenBabel toolkit to Python programmers.




 Background                                                                           OpenBabel is a C++ toolkit with extensive capabilities for
 Cheminformaticians often need to write once-off scripts                              reading and writing molecular file formats (over 80 are
 to create extract data from text files, prepare data for anal-                       supported) as well as for manipulating molecular data [2].
 ysis or carry out simple statistics. Scripting languages such                        Many standard chemistry algorithms are included, for
 as Perl, Python and Ruby are ideally suited to these day-                            example, determination of the smallest set of smallest
 to-day tasks [1]. Such languages are, however, an order of                           rings, bond order perception, addition of hydrogens, and
 magnitude or more slower than compiled languages such                                assignment of Gasteiger charges. In relation to cheminfor-
 as C++. Since cheminformaticians regularly deal with                                 matics, OpenBabel supports SMARTS searching [3],
 molecular files containing thousands of molecules and                                molecular fingerprints [4] (both Daylight-type, and struc-
 many cheminformatics algorithms are computationally                                  tural-key based), and includes group contribution
 expensive, cheminformatics toolkits are typically written                            descriptors for LogP [5], polar surface area (PSA) [6] and
 in compiled languages for performance.                                               molar refractivity (MR) [5].



                                                                                                                                          Page 1 of 7
Chem. Cent. J. 2008, 2, 5.                                                                                        (page number not for citation purposes)
Chemistry Central Journal 2008, 2:5                                          http://journal.chemistrycentral.com/content/2/1/5



Of the current popular scripting languages, Python [7] is        header files, SWIG generates a C file which, when com-
the de-facto standard language for scripting in cheminfor-       piled and linked with the Python development libraries
matics. Several commercial cheminformatics toolkits have         and OpenBabel, creates a Python extension module,
interfaces in Python: OpenEye's closed-source successor          openbabel. This can then be imported into a Python script
to OpenBabel, OEChem [8], is a C++ toolkit with inter-           like any other Python module using the "import openbabel"
faces in Python and Java; Rational Discovery's RDKit [9],        statement.
which is now open source, is a C++ cheminformatics
toolkit with a Python interface; the Daylight toolkit [10]       For a small number of C++ objects and functions, it was
from Daylight Chemical Information Systems, written in           necessary to add some convenience functions to facilitate
C, only has Java and C++ wrappers but PyDaylight [11],           access from Python. Certain types of molecule files have
available separately from Dalke Scientific, provides a           additional data present in addition to the connection
Python interface to the toolkit; the Cambios Molecular           table. OpenBabel stores these data in subclasses of OBGe-
Toolkit [12] from Cambios Consulting is a commercial             nericData such as OBPairData (for the data fields in mol-
C++ toolkit with a Python interface. There are also toolkits     ecule files such as MOL files and SDF files) and
entirely implemented in Python: Frowns [13], an open             OBUnitCell (for the data fields in CIF files). To access the
source cheminformatics toolkit by Brian Kelley, and PyBa-        data it is necessary to 'downcast' an instance of OBGener-
bel [14], an open source toolkit included in the MGLTools        icData to the specific subclass. For this reason, two con-
package from the Molecular Graphics Labs at the Scripps          venience functions were added to the interface file, one to
Research Institute. Note that the latter is not related to the   cast OBGenericData to OBPairData, and one to cast to
OpenBabel project; rather its name derives from the fact         OBUnitCell. Another convenience function was added to
that its aim was to implement in Python some of the func-        convert a Python list to a C array of doubles, as this type
tionality of Babel v1.6 [15], a command-line application         of input is required for a small number of OpenBabel
for converting file formats which is a predecessor of            functions.
OpenBabel.
                                                                 Iterators are an important feature of the OpenBabel C++
Here we describe the implementation and application of           library. For example, OBAtomAtomIter allows the user to
Pybel, a Python module that provides access to the               easily iterate over the atoms attached to a particular atom,
OpenBabel C++ library from the Python programming                and OBResidueIter is an iterator over the residues in a
language. Pybel builds on the basic Python bindings to           molecule. The OpenBabel iterators use the dereference
make it easier to carry out frequent tasks in cheminformat-      operator to access the data, the increment operator to iter-
ics. It also aims to be as 'Pythonic' as possible; that is, to   ate to the next element, and the boolean operator to test
adhere to Python language conventions and idioms, and            whether any elements remain. Iterators are also a core fea-
where possible to make use of Python language features           ture of the Python language. However, the iterators used
such as iterators. The result is a module that takes advan-      by OpenBabel are not automatically converted into
tage of Python's expressive syntax to allow cheminforma-         Python iterators. To deal with this, Python iterator classes
ticians to carry out tasks such as SMARTS matching, data         that wrap the dereference, increment and boolean opera-
field manipulation and calculation of molecular finger-          tors behind the scenes were added to the SWIG interface
prints in just a few lines of code.                              file, so that Python statements such as "for
                                                                 attached_obatom in OBAtomAtomIter(obatom)" work with-
Implementation                                                   out problem.
SWIG bindings
Python bindings to the OpenBabel toolkit were created            Pybel module
using SWIG [16]. SWIG (Simplified Wrapper and Inter-             The SWIG bindings provide direct access from Python to
face Generator) is a tool that automates the generation of       the C++ objects and functions in the OpenBabel API
bindings to libraries written in C or C++. One of the            (application programming interface). The purpose of the
advantages of SWIG compared to other automated wrap-             Pybel module is to wrap these bindings to present a more
ping methods such as Boost.Python [17] or SIP [18] is that       Pythonic interface to OpenBabel (Figure 1). This extra
SWIG also supports the generation of bindings to several         level of abstraction is useful as Python programmers
other languages. For example, OpenBabel also uses SWIG           expect Python libraries to behave in certain ways that a
to generate bindings for Perl, Ruby and Java. An addi-           C++ library does not. For example, in Python, attributes of
tional advantage is that SWIG will directly parse C or C++       an object are often directly accessed whereas in C++ it is
header files while Boost.Python and SIP require each C++         typical to call Get/Set functions to access them. A C++
class to be exposed manually. The input to SWIG is an            function returning a particular object might require a
interface file containing a list of OpenBabel header files       pointer to an empty object as a parameter, whereas the
for which to generate bindings. Using the signatures in the      Python equivalent would not. Even something as simple


                                                                                                                      Page 2 of 7
  Chem. Cent. J. 2008, 2, 5.                                                                  (page number not for citation purposes)
Chemistry Central Journal 2008, 2:5                                     http://journal.chemistrycentral.com/content/2/1/5



                                                            code shows how to store each molecule in a multimole-
                                                            cule SDF file in a list called allmols:

                                                            import openbabel

                                                            allmols = []

                                                            obconversion = openbabel.OBConversion()

                                                            obconversion.SetInFormat("sdf")

                                                            obmol = openbabel.OBMol()

                                                            notatend = obconversion.ReadFile(obmol,
                                                            "inputfile.sdf")

                                                            while notatend:

                                                                allmols.append(obmol)

                                                                obmol = openbabel.OBMol()

                                                                notatend = obconversion.Read(obmol)

                                                            To replace this somewhat verbose code, Pybel provides a
                                                            readfile method that takes a file format and filename and
                                                            returns molecules using the 'yield' keyword. This changes
                                                            the method into a 'generator', a Python language feature
                                                            where a method behaves like an iterator. Iterators are a
                                                            major feature of the Python language which are used for
                                                            looping over collections of objects. In Pybel, we have used
                                                            iterators where possible to simplify access to the toolkit.
                                                            As a result, the equivalent to the preceding code is:
Figure
text and1the OpenBabel C++ library
The relationship between Python modules described in the
The relationship between Python modules described           import pybel
in the text and the OpenBabel C++ library. Python
modules are shown in green; the C++ library is shown in     allmols = [mol for mol in                        pybel.read
blue.                                                       file("sdf", "inputfile.sdf")]

                                                            The benefits of iterator syntax are clear when dealing with
as differences in the conventions for the case of letters   multimolecule files. For single molecule files, however,
used in variable and method names is a problem, as it       the user needs to remember to explicitly request the itera-
makes it more likely for Python programmers to intro-       tor to return the first and only molecule using the next
duce bugs in their code.                                    method:

One of the key aims of Pybel was to reduce the amount of    mol   =    pybel.readfile("mol",                        "input
code necessary to carry out common tasks. This is espe-     file.mol").next()
cially important for a scripting language where program-
ming is often done interactively at a command prompt. In    Pybel provides replacements for two of the main classes in
addition, as for any programming language, repeated         the OpenBabel library, OBMol and OBAtom. The follow-
entry of code for routine and common tasks (so-called       ing discussion describes the Pybel Molecule class which
'boilerplate code') is a common cause of errors in code.    wraps an instance of OBMol, but the same design princi-
Reading and writing molecule files is one of the most       ples apply to the Pybel Atom class. Table 1 summarises
common tasks for users of OpenBabel but requires several    the attributes and methods of the Molecule object. By
lines of code if using the SWIG bindings. The following     wrapping the base class, Pybel can enhance the Molecule


                                                                                                                 Page 3 of 7
  Chem. Cent. J. 2008, 2, 5.                                                             (page number not for citation purposes)
Chemistry Central Journal 2008, 2:5                                                        http://journal.chemistrycentral.com/content/2/1/5



Table 1: Attributes and methods supported by the Pybel Molecule object

 Attribute              Description*

 OBMol                  The underlying OBMol object
 atoms                  A list of Pybel Atoms
 charge                 The total charge (GetTotalCharge)
 data                   A MoleculeData object for access to data fields
 dim                    The dimensionality of the coordinates (GetDimension)
 energy                 The heat of formation (GetEnergy)
 exactmass              The mass calculated using isotopic abundance (GetExactMass)
 flags                  The set of flags used internally by OpenBabel (GetFlags)
 formula                The stoichiometric formula (GetFormula)
 mod                    The number of nested BeginModify() calls (Internal use) (GetMod)
 molwt                  The standard molar mass (GetMolWt)
 spin                   The total spin multiplicity (GetTotalSpinMultiplicity)
 sssr                   The smallest set of smallest rings (GetSSSR)
 title                  The title of the molecule (often the filename) (GetTitle)
 unitcell               Unit cell data (if present)

 Method
 write                  Write the molecule to a file or return it as a string
 calcfp                 Return a molecular fingerprint as a Fingerprint object
 calcdesc               Return the values of the group contribution descriptors
 __iter__               Enable iteration over the Atoms in the Molecule

 *Where a Molecule attribute is a direct replacement for a 'Get' method of the underlying OBMol, the name of the method is given in parentheses.


object by providing (1) direct access to attributes rather                 # Using Pybel
than through the use of Get methods, (2) additional
attributes of the object, and (3) additional methods that                  value =           pybel.Molecule(mol).data                     ["com
act on the object.                                                         ment"]

(1) As mentioned earlier, it is typical in Python to access                It should be noted that all of these attributes are calculated
attribute values directly rather than using Get/Set meth-                  on-the-fly rather than stored for future access as the under-
ods. With this in mind, the Molecule class adds attributes                 lying OBMol may have been modified.
such as energy, formula and molwt (among others) which
give the values returned by calling GetEnergy(), GetFor-                   (3) Four additional methods have been added to the
mula() and GetMolWt(), respectively on the underlying                      Pybel Molecule (Table 1). The first is a write method
OBMol (see Table 1 for the full list).                                     which writes a representation of the Molecule to a file and
                                                                           takes care of error handling. As with reading molecules
(2) One of the aims of Pybel is to simplify access to some                 from files (see above), this method simplifies the proce-
of the most common attributes. With this in mind, an                       dure significantly compared to using the SWIG bindings
atoms attribute has been added which returns a list of the                 directly. In addition, a calcfp method and a calcdesc
atoms of the molecule as Pybel Atoms. Access to the data                   method have been added which calculate a binary finger-
fields associated with a molecule has been simplified by                   print for the molecule, and some descriptor values, respec-
creation of a MoleculeData object which is returned when                   tively. In the OpenBabel library these are not methods of
the data attribute of a Molecule is accessed. MoleculeData                 the OBMol, but rather are loaded as plugins (by OBFin-
presents a dictionary interface to the data fields of the                  gerprint.FindFingerprint and OBDescriptor.FindType,
molecule. Accessing and updating these field is more con-                  respectively) to which an OBMol is passed as input. The
voluted if using the SWIG bindings. Compare the follow-                    __iter__ method is a special Python method that enables
ing statements for accessing the "comment" field of the                    iteration over an object; in the case of a Molecule, the
variable mol, an OBMol:                                                    defined iterator loops over the Atoms of the Molecule.
                                                                           This feature enables constructions such as "for atom in
# Using the SWIG bindings                                                  mol" where mol is a Pybel Molecule.

value = openbabel.toPairData(mol.GetData                                   SMARTS is a query language developed by Daylight
["comment"]).GetValue()                                                    Chemical Information Systems for molecular substructure


                                                                                                                                     Page 4 of 7
  Chem. Cent. J. 2008, 2, 5.                                                                                 (page number not for citation purposes)
Chemistry Central Journal 2008, 2:5                                        http://journal.chemistrycentral.com/content/2/1/5



searching [3]. As implemented in the OpenBabel toolkit,        The OBMol wrapped by a Pybel Molecule can be accessed
finding matches of a particular substructure in a particular   through the OBMol attribute. This makes it easy to call a
molecule is a four step process that involves creating an      method not wrapped by Pybel, such as OBMol.NumRotors,
instance of OBSmartsPattern, initialising it with a            which returns the number of rotatable bonds in a mole-
SMARTS pattern, searching for a match, and finally             cule:
retrieving the result:
                                                               mol   =    pybel.readfile("mol",                         "input
obsmarts = openbabel.OBSmartsPattern()                         file.mol").next()

obsmarts.Init("[#6] [#6]")                                     numrotors = mol.OBMol.NumRotors()

obsmarts.Match(obmol)                                          Documentation and Testing
                                                               To minimise programming errors, programs written
results = obsmarts.GetUMapList()                               dynamically-typed languages such as Python should be
                                                               tested comprehensively. Pybel has 100% code coverage in
Since a SMARTS query can be thought of as a regular            terms of unit tests, as measured by Ned Batchelder's cov-
expression for molecules, in Pybel we decided to wrap the      erage.py [19]. It also has several doctests, short snippets of
SMARTS functionality in an analogous way to Python's           Python code included in documentation strings which
regular expression module, re. With these changes, the         serve as both examples of usage and as unit tests.
same process takes only two steps, an initialisation step
and a search step:                                             The Pybel API is fully documented with docstrings. These
                                                               can be accessed in the usual way with the help() com-
smarts = pybel.Smarts("[#6] [#6]")                             mand at the interactive Python prompt after importing
                                                               Pybel: for example, "help(pybel.Molecule)". In addition, the
results = smarts.findall(pybelmol)                             OpenBabel Python web page [20] contains a complete
                                                               description of how to use the SWIG bindings and the
Pybel was not written to replace the SWIG bindings but         Pybel API. The webpage also contains links to HTML ver-
rather to make it simpler to perform common tasks. As a        sions of the OpenBabel API documentation and Pybel API
result, Pybel does not attempt to wrap every single            documentation. The latter is included in Additional File 1.
method and class in the OpenBabel library. Because of
this, a user may often want to interconvert between an         Results and Discussion
OBMol and a Molecule, or an OBAtom and an Atom. This           The principle aim of Pybel is to make it simpler to use the
is quite a straightforward process. A Pybel Molecule can be    OpenBabel toolkit to carry out common tasks in chem-
created by passing an OBMol to the Molecule constructor.       informatics. These common tasks include reading and
In the following example an OBMol is created using the         writing molecule files, accessing data fields of a molecule,
SWIG bindings and then written to a file using Pybel:          computing and comparing molecular fingerprints and
                                                               SMARTS matching. Here we present some examples that
obmol = openbabel.OBMol()                                      illustrate how Pybel may be used to carry out common
                                                               cheminformatics tasks.
a = obmol.NewAtom()
                                                               Removal of duplicate molecules
a.SetAtomicNum(6)                                              When merging different datasets or as a final step in pre-
                                                               processing, it may be necessary to identify and remove
a.SetVector(0.0, 1.0, 2.0) # Set coordi                        duplicate molecules. In the following example, only the
nates                                                          unique molecules in the multimolecule SDF file "input-
                                                               file.sdf" will be written to "uniquemols.sdf". Here we will
b = obmol.NewAtom()                                            assume that a unique InChI string (IUPAC International
                                                               Chemical Identifier) indicates a unique molecule. A simi-
obmol.AddBond(1, 2, 1) # Single bond from                      lar procedure could be performed using the OpenBabel
Atom 1 to Atom 2                                               canonical SMILES format, by replacing "inchi" with "can"
                                                               in the following:
pybel.Molecule(obmol).write("mol",                    "out
putfile.mol")                                                  import pybel

                                                               inchis = []


                                                                                                                     Page 5 of 7
  Chem. Cent. J. 2008, 2, 5.                                                                 (page number not for citation purposes)
Chemistry Central Journal 2008, 2:5                                         http://journal.chemistrycentral.com/content/2/1/5



output      =     pybel.Outputfile("sdf",                       ties. This is the Lipinski Rule of Fives, so-called as the
"uniquemols.sdf")                                               numbers involved are all multiples of five. The following
                                                                example shows how to filter a database to identify only
for mol in pybel.readfile("sdf", "input                         those molecules that pass all four of the Lipinski criteria.
file.sdf"):                                                     The values of the Lipinski descriptors are also added to the
                                                                output file as data fields. Note that whereas molecular
    inchi = mol.write("inchi")                                  weight is directly available as an attribute of a Molecule,
                                                                and LogP is available as one of the three group contribu-
    if inchi not in inchis:                                     tion descriptors calculated by OpenBabel, we need to use
                                                                SMARTS pattern matching to identify the number of
        output.write(mol)                                       hydrogen bond donors and acceptors. The SMARTS pat-
                                                                terns used here correspond to the definitions of hydrogen
        inchis.append(inchi)                                    bond donor and acceptor used by Lipinski:

output.close()                                                  import pybel

Selection of similar molecules                                  HBD = pybel.Smarts("[#7,#8;!H0]")
Another common task in cheminformatics is the selection
of a set of molecules of similar structure to a target mole-    HBA = pybel.Smarts("[#7,#8]")
cule. Here we will assume that structural similarity is indi-
cated by a Tanimoto coefficient [21] of at least 0.7 with       def lipinski(mol):
respect to Daylight-type (that is, based on hashed paths
through the molecular graph) fingerprints. Note that               """Return the values of the Lipinski
Pybel redefines the | operator (bitwise OR) for Fingerprint     descriptors."""
objects as the Tanimoto coefficient:
                                                                    desc = {'molwt': mol.molwt,
import pybel
                                                                        'HBD': len(HBD.findall(mol)),
targetmol = pybel.readfile("sdf", "target
mol.sdf").next()                                                        'HBA': len(HBA.findall(mol)),

targetfp = targetmol.calcfp()                                         'LogP':               mol.calcdesc(['LogP'])
                                                                ['LogP']}
output = pybel.Outputfile("sdf", "similar
mols.sdf")                                                          return desc

for mol in pybel.readfile("sdf", "input                         passes_all_rules = lambda                    desc:        (desc
file.sdf"):                                                     ['molwt'] <= 500 and

    fp = mol.calcfp()                                                    desc ['HBD']               <=      5     and       desc
                                                                ['HBA'] <= 10 and
    if fp | targetfp >= 0.7:
                                                                             desc ['LogP'] <= 5)
        output.write(mol)
                                                                if __name__=="__main__":
output.close()
                                                                   output = pybel.Outputfile("sdf", "pas
Applying a Rule of Fives filter                                 sLipinski.sdf")
In an influential paper, Lipinski et al. [22] performed an
analysis of drug compounds that reached Phase II clinical          for   mol   in            pybel.readfile("sdf",
trials and found that they tended to occupy a certain range     "inputfile.sdf"):
of values for molecular weight, LogP, and number of
hydrogen bond donors and acceptors. Based on this, they                 descriptors = lipinski(mol)
proposed a rule with four criteria to identify molecules
that might have poor absorption or permeation proper-                   if passes_all_rules(descriptors):


                                                                                                                     Page 6 of 7
  Chem. Cent. J. 2008, 2, 5.                                                                 (page number not for citation purposes)
Chemistry Central Journal 2008, 2:5                                              http://journal.chemistrycentral.com/content/2/1/5



             mol.data.update(descriptors)                       Additional material

             output.write(mol)
                                                                     Additional file 1
                                                                     Pybel API. The HTML documentation of the Pybel API (application pro-
    output.close()                                                   gramming interface).
                                                                     Click here for file
Future work                                                          [http://www.biomedcentral.com/content/supplementary/1752-
The future development of Pybel is closely linked to any             153X-2-5-S1.zip]
changes and improvements to OpenBabel. With each new
release of the OpenBabel API, the SWIG bindings will be
updated to include any additional functionality. How-
ever, additions to the Pybel API will only occur if they sim-   Acknowledgements
plify access to new features of the OpenBabel toolkit of        The idea for the Pybel module was inspired by Andrew Dalke's work on
                                                                PyDaylight [11]. We thank the anonymous reviewers for their helpful com-
general use to cheminformaticians. In general, the Pybel
                                                                ments.
API can be considered stable, and an effort will be made
to ensure that future changes will be backwards compati-
                                                                References
ble.                                                            1.      Ousterhout JK: Scripting: Higher Level Programming for the
                                                                        21st Century. [http://home.pacbell.net/ouster/scripting.html].
Conclusion                                                      2.      OpenBabel v.2.1.1 [http://openbabel.sf.net]
                                                                3.      SMARTS – A Language for Describing Molecular Patterns
Pybel provides a high-level Python interface to the widely-             [http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html]
used OpenBabel C++ toolkit. This combination of a high          4.      Flower DR: On the properties of bit string-based measures of
                                                                        chemical similarity. J Chem Inf Comput Sci 1998, 38:379-386.
performance cheminformatics toolkit and an expressive           5.      Wildman SA, Crippen GM: Prediction of physicochemical
scripting language makes it easy for cheminformaticians                 parameters by atomic contributions. J Chem Inf Comput Sci
to rapidly and efficiently write scripts to manipulate                  1999, 39:868-873.
                                                                6.      Ertl P, Rohde B, Selzer P: Fast calculation of molecular polar
molecular data.                                                         surface area as a sum of fragment-based contributions and
                                                                        its application to the prediction of drug transport properties.
Pybel is freely available from the OpenBabel web site2                  J Med Chem 2000, 43:3714-3717.
                                                                7.      Python [http://www.python.org]
both as part of the OpenBabel source distribution and for       8.      OEChem: OpenEye Scientific Software: Santa Fe, NM. .
Windows as an executable installer. Compiled versions           9.      RDKit [http://www.rdkit.org]
                                                                10.     Daylight Toolkit: Daylight Chemical Information Systems,
are also available as packages in some Linux distributions              Inc.: Aliso Viejo, CA. .
(openbabel-python in Fedora, for example).                      11.     PyDaylight: Dalke Scientific Software, LLC: Santa Fe, NM. .
                                                                12.     Cambios Molecular Toolkit: Cambios Computing, LLC: Palo
                                                                        Alto, CA. .
Availability and Requirements                                   13.     Frowns [http://frowns.sf.net]
Project name: Pybel                                             14.     PyBabel in MGLTools [http://mgltools.scripps.edu]
                                                                15.     Babel v.1.6 [http://smog.com/chem/babel/]
                                                                16.     SWIG v.1.3.31 [http://www.swig.org]
Project home page: http://openbabel.sf.net/wiki/Python          17.     Boost.Python [http://www.boost.org/libs/python/doc/]
                                                                18.     SIP – A Tool for Generating Python Bindings for C and C++
                                                                        Libraries [http://www.riverbankcomputing.co.uk/sip/]
Operating system(s): Platform independent                       19.     coverage.py           [http://nedbatchelder.com/code/modules/cover
                                                                        age.html]
Programming language: Python                                    20.     OpenBabel Python               [http://openbabel.sourceforge.net/wiki/
                                                                        Python]
                                                                21.     Jaccard P: La distribution de la flore dans la zone alpine. Rev
Other requirements: OpenBabel                                           Gen Sci Pures Appl 1907, 18:961-967.
                                                                22.     Lipinski CA, Lombardo F, Dominy BW, Feeney PJ: Experimental
                                                                        and computational approaches to estimate solubility and
License: GNU GPL                                                        permeability in drug discovery and development settings.
                                                                        Adv Drug Del Rev 1997, 23:3-25.
Any restrictions to use by non-academics: None

Authors' contributions
GRH is the lead developer of OpenBabel and created the
SWIG bindings. NMOB developed Pybel, and extended
the SWIG interface file. CM compiled the SWIG bindings
on Windows and added convenience functions to the
OpenBabel API to facilitate access from scripting lan-
guages. All authors read and approved the final manu-
script.


                                                                                                                               Page 7 of 7
  Chem. Cent. J. 2008, 2, 5.                                                                           (page number not for citation purposes)
My Open Access papers
Chemistry Central Journal
 Software                                                                                                                                 Open Access
 Cinfony – combining Open Source cheminformatics toolkits behind
 a common interface
 Noel M O'Boyle*1 and Geoffrey R Hutchison2

 Address: 1Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, UK and 2Department of Chemistry, University of
 Pittsburgh, Chevron Science Center, 219 Parkman Avenue, Pittsburgh, PA 15260, USA
 Email: Noel M O'Boyle* - oboyle@ccdc.cam.ac.uk; Geoffrey R Hutchison - geoffh@pitt.edu
 * Corresponding author




 Published: 3 December 2008                                                      Received: 9 October 2008
                                                                                 Accepted: 3 December 2008
 Chemistry Central Journal 2008, 2:24   doi:10.1186/1752-153X-2-24
 This article is available from: http://journal.chemistrycentral.com/content/2/1/24
 © 2008 O'Boyle et al




                   Abstract
                   Background: Open Source cheminformatics toolkits such as OpenBabel, the CDK and the RDKit
                   share the same core functionality but support different sets of file formats and forcefields, and
                   calculate different fingerprints and descriptors. Despite their complementary features, using these
                   toolkits in the same program is difficult as they are implemented in different languages (C++ versus
                   Java), have different underlying chemical models and have different application programming
                   interfaces (APIs).
                   Results: We describe Cinfony, a Python module that presents a common interface to all three of
                   these toolkits, allowing the user to easily combine methods and results from any of the toolkits. In
                   general, the run time of the Cinfony modules is almost as fast as accessing the underlying toolkits
                   directly from C++ or Java, but Cinfony makes it much easier to carry out common tasks in
                   cheminformatics such as reading file formats and calculating descriptors.
                   Conclusion: By providing a simplified interface and improving interoperability, Cinfony makes it
                   easy to combine complementary features of OpenBabel, the CDK and the RDKit.




 Background                                                                           In general, all of these toolkits share the same core func-
 Cheminformatics toolkits are essential to the day-to-day                             tionality although the implementation details and under-
 work of the practising cheminformatician. They enable                                lying chemical model may differ. However, as a result of
 the user to deal with such tasks as handling different                               their independent development and history, each has
 chemistry file formats, substructure searching, calculation                          functionality specific to itself and each toolkit supports
 of molecular fingerprints, and structure diagram genera-                             different sets of file formats and forcefields, and can calcu-
 tion. The main Open Source cheminformatics libraries                                 late different molecular fingerprints and molecular
 under active development are OpenBabel [1], the Chem-                                descriptors (Table 1). Despite the diversity of these
 istry Development Kit (CDK) [2], and the RDKit [3].                                  toolkits and the potential benefits in being able to access
 OpenBabel is a C++ toolkit with bindings in Perl, Python,                            all of them at the same time, there has been little work on
 Ruby and Java, the CDK is a Java toolkit, while the RDKit                            interoperability between them. This has resulted in a bal-
 is another C++ toolkit with Python bindings. While the                               kanization of this field such that users of one toolkit rarely
 CDK has its origins in academia, both OpenBabel and the                              use another toolkit.
 RDKit originated in companies (OpenEye and Rational
 Discovery, respectively) and have subsequently been                                  One way to achieve interoperability of chemical toolkits is
 developed by the community under Open Source licenses.                               through the use of standard file formats for exchange of

                                                                                                                                          Page 1 of 10
Chem. Cent. J. 2008, 2, 24.                                                                                         (page number not for citation purposes)
Chemistry Central Journal 2008, 2:24                                                     http://journal.chemistrycentral.com/content/2/1/24



Table 1: Some features of toolkits which are not shared by all three toolkits.

 CDK
 A large number of descriptors (some overlap with RDKit)
 Pharmacophore searching (like RDKit*)
 Calculation of maximum common substructure
 2D structure layout (like RDKit) and depiction
 MACCS keys (also RDKit) and E-State fingerprints
 Integration with the R statistical programming environment
 Support for mass-spectrometry analysis (representations for cleavage reactions, structure generation from formulae)
 Fragmentation schemes (ring fragments, Murcko)
 3D structure generation using a template and heuristics (like OpenBabel)
 3D similarity using ultrafast shape descriptors
 Gasteiger π charge calculation

 OpenBabel
 Not just focused on cheminformatics
 Supports a very large number of chemical file formats including quantum mechanics file formats, molecular mechanics trajectories, 2D sketchers
 3D structure generation using a template method (like CDK)
 Included in all major Linux distributions
 Bindings available from several scripting languages apart from Python, as well as the Java and .NET platforms
 Conformation generation and searching
 InChI (also CDK) and InChIKey generation
 Support for crystallographic space groups
 Several forcefield implementations: UFF (also RDKit), MMFF94, MMFF94s, Ghemical
 Ability to add custom data types to atoms, bonds, residues, molecules

 RDKit
 A large number of descriptors (some overlap with CDK)
 Fragmentation using RECAP rules
 2D coordinate generation (like CDK) and depiction
 3D coordinate generation using geometry embedding
 Calculation of Cahn-Ingold-Prelog stereochemistry codes (R/S)
 Pharmacophore searching (like CDK)
 Calculation of shape similarity (based on volume overlap)
 Chemical reaction handling and transforms
 Atom pairs and topological torsions fingerprints
 Feature maps and feature-map vectors
 Machine-learning algorithms

 * Where the term "like" is used, it indicates that the implementation details differ.



data. For example, the CML project has defined a stand-                        models between different toolkits, and differences in the
ardised XML format for chemical data [4], with successive                      API for core cheminformatics tasks shared by the toolkits.
releases refining and extending the original standard. The
OpenSMILES effort [5] has attempted to resolve ambigui-                        Here we describe Cinfony, a Python module that over-
ties in the published SMILES definition [6] to create a                        comes these barriers to provide interoperability at the API
standard. While these efforts deserve support, they face                       level. Cinfony allows access to OpenBabel, the CDK, and
inevitable problems achieving consensus and they require                       the RDKit through a common interface, and uses a simple
changes to existing software to support the standard. The                      yet robust method to pass chemical models between
large number of chemical file formats supported by                             toolkits. Pybel, one of the components of Cinfony, has
OpenBabel (currently over 80) illustrates both the poten-                      been described previously [7]. It provides access to
tial of achieving a standard as well as the difficulties.                      OpenBabel from standard Python. In this work, we show
                                                                               that the API developed for Pybel may be considered a
An alternative is interoperability at the API (application                     generic API for accessing any cheminformatics toolkit. We
programming interface) level. This has the advantage that                      describe the design and implementation of the Cinfony
it does require any changes to existing software. However,                     API for OpenBabel, the RDKit and the CDK. Next, we
there are at least three barriers to overcome: the need for a                  show how Cinfony simplifies the process of accessing the
programming language that can access all the toolkits                          toolkits and how it can be used in practice to combine the
simultaneously, the difficulty of exchanging chemical                          power of the three Open Source toolkits. Finally, we dis-

                                                                                                                                     Page 2 of 10
  Chem. Cent. J. 2008, 2, 24.                                                                                  (page number not for citation purposes)
Chemistry Central Journal 2008, 2:24                                                 http://journal.chemistrycentral.com/content/2/1/24



cuss performance and some results from comparisons of                   Although the OBMol of OpenBabel has a corresponding
the toolkits.                                                           method, OBMol.AddHydrogens(), the RDKit uses a glo-
                                                                        bal method, AddHs(Mol), while the CDK requires the
Implementation                                                          user to instantiate a HydrogenAdder object, which can
Common Application Programming Interface                                then be used to add hydrogens.
Cinfony presents the same interface to three cheminfor-
matics toolkits, OpenBabel, the CDK and the RDKit.                      The Molecule methods described in the original Pybel API
These are available through three separate modules: oba-                [7] have been extended to handle hydrogen addition and
bel, cdk and rdkit. The API is designed to make it easy to              removal, structure diagram generation, assignment of 3D
carry out many of the common tasks in cheminformatics,                  geometry to 0D structures and geometry optimisation
and covers the core functionality shared by all of the                  using forcefields. Both the CDK and the RDKit are capable
toolkits. Table 2 gives an overview of the API. The com-                of 2D coordinate generation and 2D depiction. However,
plete API is available here (see Additional file 1).                    since OpenBabel currently has neither of these capabili-
                                                                        ties, a fourth toolkit, OASA, is used by Pybel for this pur-
The main class containing chemical information is the                   pose. OASA is a lightweight cheminformatics toolkit
Molecule class. Rather than create a new chemical model,                implemented in Python [8].
the Molecule class is a light wrapper around the molecule
object in the underlying library, for example, around                   A new development in the latest version of OpenBabel is
OBMol in the case of OpenBabel. Attribute values such as                3D coordinate generation and geometry optimisation
the molecular weight are calculated dynamically by query-               using one of a number of forcefields. Since these methods
ing the underlying molecule. This ensures that if the                   are also available in the RDKit, and are under develop-
underlying OBMol, for example, is altered, the attribute                ment in the CDK, two additional methods have been
values returned will still be correct. The actual underlying            added to the Cinfony Molecule: make3D(), for 3D coor-
object (an OpenBabel OBMol, a CDK Molecule, or an                       dinate generation, and localopt(), for geometry optimisa-
RDKit Mol) can be accessed directly at any point.                       tion. Particularly in the case of OpenBabel, these new
                                                                        methods simplify the process of generating 3D coordi-
The Molecule class also contains several methods that act               nates. Compare a single call to make3D() in Cinfony with
on molecules such as methods for calculating fingerprints,              the following OpenBabel code:
adding hydrogens, and calculating descriptor values. This
makes it easy to access these methods, and also brings                  structuregenerator = openbabel.OBOp.Find
them to the attention of the user. In the underlying toolkit            Type('Gen3D')
these methods may not be present as part of the molecule
class, and in fact, they can be difficult to find in the                structuregenerator.Do(mol)
toolkit's API. For example, the Cinfony method Mole-
cule.addh() adds explicit hydrogens to the molecule.                    mol.AddHydrogens()
Table 2: An overview of the Cinfony API.

 Class name       Purpose

 Molecule         Wraps a molecule instance of the underlying toolkit and provides access to methods that act on molecules
 Atom             Wraps an atom instance of the underlying toolkit
 MoleculeData     Provides dictionary-like access to the information contained in the tag fields in SDF and MOL2 files
 Outputfile       Handles multimolecule output file formats
 Smarts           Wraps the SMARTS functionality of the toolkit in an analogous way to the Python 're' module for regular expression matching
 Fingerprint      Simplifies Tanimoto calculation of binary fingerprints

 Function name
 readfile         Return an iterator over Molecules in a file
 readstring       Return a Molecule

 Variable name
 descs            A list of descriptor IDs
 forcefields      A list of forcefield IDs
 fps              A list of fingerprint IDs
 informatsaa      A list of input format IDs
 outformats       A list of output format IDs




                                                                                                                                 Page 3 of 10
  Chem. Cent. J. 2008, 2, 24.                                                                              (page number not for citation purposes)
Chemistry Central Journal 2008, 2:24                                      http://journal.chemistrycentral.com/content/2/1/24



ff      =      openbabel.OBForceField.Find                     translation process is transparent to the user. However,
Type("MMFF94")                                                 the user should be aware of known limitations of particu-
                                                               lar readers or writers. For example, the SMILES parser in
ff.Setup(mol)                                                  CDK 1.0.3 ignores atom-based stereochemistry and thus
                                                               that information is lost if a 0D rdkit or obabel Molecule
ff.SteepestDescent(50)                                         with atom-based stereochemistry is converted to a cdk
                                                               Molecule.
ff.GetCoordinates(mol)
                                                               Cinfony Molecules are interconverted using the Mole-
The Cinfony API is identical for all of the toolkits. How-     cule() constructor. For example, if obabelmol is an obabel
ever, the values returned by particular API calls are not      Molecule, then the corresponding rdkit Molecule can be
necessarily standardised across toolkits. This Cinfony         constructed using rdkit.Molecule(pybelmol). This mecha-
design decision is in agreement with the Principle of Least    nism can also be used to interface Cinfony to other chem-
Surprise [9]; when the user accesses the underlying toolkit    informatics toolkits. The only requirements are that the
directly, they will get the same result as found when using    object passed to the Molecule() constructor needs to have
Cinfony. This design decision places the responsibility on     a _cinfony attribute set to True, and an _exchange
the user to become familiar with differences in how the        attribute containing a tuple (0, SMILES string) or (1, MOL
toolkits behave. For example, all of the toolkits allow the    file) depending on whether the molecule is 0D or not.
calculation of path-based fingerprints. These encode all
paths in the molecular graph up to a path length of P into     Implementation
a binary vector of length V, but the default values for V      The Python scripting language has two main implementa-
and P are different for each toolkit: 1024 and 7 for           tions. The most widely used implementation is the origi-
OpenBabel, 1024 and 8 for the CDK, and 2048 and 7 for          nal reference implementation of Python in C, referred to
RDKit. Although it is possible to alter these parameters for   as CPython when necessary to distinguish it from other
the CDK and the RDKit and so standardise V and P to            implementations. The next most widely used implemen-
1024 and 7 for all of the toolkits, it is reasonable to        tation is Jython, an implementation of Python in Java.
assume that the developers of each package have chosen         Although most users of Python do so through CPython,
sensible defaults. In addition, the implementation details     Jython scripts have the advantage of being able to access
of each of the fingerprinters would still be different; for    Java libraries natively. They can also be compiled into Java
example, the RDKit sets four bits when hashing each            classes to be used from Java programs. Jython scripts are
molecular path, the others set one; OpenBabel does not         also useful in contexts where Java is required but it is more
set any bits for the one-atom fragments, N, C and O.           convenient to work in Python; for example, to implement
                                                               a Java web servlet or a node in a Java workflow environ-
Interoperability                                               ment such as KNIME [11].
The ability to transfer chemical models between toolkits is
essential to the goal of interoperability. However, the        As discussed earlier, one of the barriers to interoperability
internal representation of a molecule is specific to a par-    is the requirement for a programming language that can
ticular toolkit. For example, as well as the connection        simultaneously access more than one of the toolkits. From
table and coordinates (if present), it may include derived     CPython it is possible to use Cinfony modules to connect
data relating to aromaticity, the number of implicit hydro-    to OpenBabel (pybel), the CDK (cdkjpype) and the RDKit
gens on an atom, or stereochemical configuration. Fortu-       (rdkit). From Jython, there are modules for OpenBabel
nately, the problem of transfer and storage of chemical        (jybel) and the CDK (cdkjython). Convenience modules
information has already been solved by the development         obabel and cdk are provided that automatically import the
of molecular file formats, of which over 80 are now sup-       appropriate OpenBabel or CDK module depending on
ported by OpenBabel. Specifically, the MDL MOL file for-       the Python implementation. The relationship between
mat [10] and the SMILES format [5,6] are shared by all         these Cinfony modules and the underlying cheminfor-
three toolkits, and are used by Cinfony to exchange infor-     matics libraries is summarised in Figure 1.
mation on molecules with 2D or 3D coordinates (MOL
file format), and no coordinates (SMILES format), respec-      pybel and jybel
tively.                                                        OpenBabel provides SWIG [12] bindings for both CPy-
                                                               thon and Java (among other languages). pybel is a wrapper
By using existing file formats rather than trying to inter-    around the CPython bindings, and has previously been
convert the internal models themselves, Cinfony takes          described in detail [7]. jybel is an implementation of the
advantage of the existing input/output code of each            Cinfony API that allows the user to access OpenBabel
toolkit which is well-tested and mature. In addition, the      from Jython using the Java bindings. Despite the fact that


                                                                                                                   Page 4 of 10
  Chem. Cent. J. 2008, 2, 24.                                                                (page number not for citation purposes)
Chemistry Central Journal 2008, 2:24                                       http://journal.chemistrycentral.com/content/2/1/24



                                                                 rdkit
                                                                 Support for Python scripting has been part of the design
                                                                 of the RDKit from the start. The Python bindings in RDKit
                                                                 were created using Boost.Python [14], a framework for
                                                                 interfacing Python and C++. The Cinfony module rdkit
                                                                 uses these bindings to implement its API. It is currently
                                                                 not possible to access RDKit from Jython. RDKit has only
                                                                 preliminary support for Java bindings; when these are
                                                                 complete, a corresponding module will be added to Cin-
                                                                 fony.

                                                                 Dependency handling
                                                                 A fully-featured installation of Cinfony relies on a large
Figure 1
Relationship of Cinfony modules to Open Source toolkits          number of open source libraries. In particular, the 2D
Relationship of Cinfony modules to Open Source                   depiction capabilities introduce dependencies on several
toolkits. Python modules are accessible from CPython             graphics libraries which may be problematic to install on
(green), Jython (pale blue), or both (striped green and pale     a particular platform (Cairo and its Python bindings,
blue). Java libraries are indicated by dark blue, while C++      Python Imaging Library, AGG and the Python wrapper
libraries are yellow.                                            AggDraw). With this in mind, Cinfony treats all depend-
                                                                 encies as optional and only raises an Exception if the user
                                                                 calls a method or imports a module that requires a miss-
                                                                 ing dependency.

jybel is used from a Java implementation of Python, and          For example, the Python Imaging Library (PIL) is required
accesses a C++ library through the Java Native Interface         for displaying a 2D depiction on the screen. If all of the
(JNI), the jybel code differs from pybel in very few respects.   components of cinfony are installed except for PIL, Cin-
In Jython, it is not possible to iterate directly over the       fony works perfectly except that an Exception is raised if
wrapped STL vectors used by OpenBabel as their Java              the Molecule.draw() method is called with show = True
SWIG bindings do not implement the Iterable interface.           (the default). The image can however be written to a file
Also, the current Jython implementation is 2.2 and does          without problems (show = False, filename =
not support generator expressions, which were introduced         "image.png"). Similarly, if a user is only interested in
in Python 2.4. Although both C++ and Python have the             using the CDK and the RDKit, it is not necessary to install
concept of a global function or variable, this is not the        OpenBabel.
case in Java. SWIG places such functions, and get/set
methods for accessing the variables, in a special class          Full installation instructions for Windows, MacOSX and
named openbabel. Global constants are placed in another          Linux are available from the Cinfony website. It should be
class called openbabelConstants. A convenience module,           noted that for Windows users, there is no need to compile
obabel, is provided which automatically imports the              or search for missing libraries as the dependencies are
appropriate module depending on the Python implemen-             included as binaries in the Cinfony distribution.
tation.
                                                                 Results
cdkjpype and cdkjython                                           Cinfony API
Since Jython runs on top of the Java Virtual Machine             The original Pybel API was designed to make it easy to use
(JVM), it can access Java libraries such as the CDK              OpenBabel to perform the most common tasks in chem-
natively. To access Java libraries from CPython, the             informatics and to do so using idiomatic Python. Subse-
Python library JPype [13] is needed. This starts an instance     quently, we realised that the resulting API could be
of the JVM and uses the JNI to communicate back and              considered a generic API for wrapping the core function-
forth. Overall, the differences between the two wrappers         ality of any cheminformatics toolkit. Cinfony implements
are minor. Jython and JPype differ in the syntax used to         an extended version of the original Pybel API for the CDK
handle Java exceptions. Also, JPype returns unicode              and the RDKit, as well as OpenBabel. While the original
strings from the CDK and these need to be converted to           Pybel was restricted to CPython, Cinfony can also be used
regular strings (otherwise problems arise if they are passed     from Jython to access the CDK and OpenBabel.
to an OpenBabel method expecting a std::string). The
appropriate CDK wrapper, cdkjpype or cdkjython, will be          Cinfony helps cheminformaticians avoid the steep learn-
imported if the user imports the convenience module cdk.         ing curve associated with starting to use a new toolkit.


                                                                                                                    Page 5 of 10
  Chem. Cent. J. 2008, 2, 24.                                                                 (page number not for citation purposes)
Chemistry Central Journal 2008, 2:24                                     http://journal.chemistrycentral.com/content/2/1/24



With Cinfony, all of the core functionality of the toolkits   targetfp = targetmol.calcfp()
can be accessed with the same interface. For example, in
Cinfony, a molecule can be created from a SMILES string       output = cdk.Outputfile("sdf",                         "similar
with:                                                         mols.sdf")

mol   =    toolkit.readstring("smi",                  SMI     for mol in          cdk.readfile("sdf",                  "input
LESstring)                                                    file.sdf"):

RDKit                                                             fp = mol.calcfp()

mol = Chem.MolFromSmiles(SMILESstring)                            if fp | targetfp >= 0.7:

OpenBabel                                                             output.write(mol)

mol = openbabel.OBMol()                                       output.close()

obconversion = openbabel.OBConversion()                       Alternatively, we could just have made a single change to
                                                              the original script, by replacing the import statement from
obconversion.SetInFormat("smi")                               "import pybel" with "from cinfony import cdk as pybel".

obconversion.ReadString(mol,                          SMI     Using Cinfony to combine toolkits
LESstring)                                                    Another goal of Cinfony is to make it easy to combine
                                                              toolkits in the same script. This allows the user to exploit
CDK                                                           the complementary capabilities of different toolkits
                                                              (Table 1). For example, let's suppose the user wants to (1)
builder      =      cdk.DefaultChemObject                     convert a SMILES string to 3D coordinates with OpenBa-
Builder.getInstance()                                         bel, then (2) create a 2D depiction of that molecule with
                                                              the RDKit, next (3) calculate descriptors with the CDK,
sp = cdk.smiles.SmilesParser(builder)                         and finally (4) write out an SDF file containing the
                                                              descriptor values and the 3D coordinates. The full Python
mol = sp.parseSmiles(SMILESstring)                            script is only seven lines long:

The RDKit was designed with Python scripting in mind,         from cinfony import rdkit, cdk, obabel
and of the three toolkits is the most concise. On the other
hand, OpenBabel uses a characteristically C++ approach.       mol = obabel.readstring("smi", "CCC=O")
An empty molecule is created, and is passed to an OBCon-
version instance as a container for the molecule read from    mol.make3D()
the SMILES string. The SmilesParser in the CDK requires
an instance of an object implementing the IChemObject-        rdkit.Molecule(mol).draw(show                      =     False,
Builder interface.                                            filename = "aldehyde.png")

Another advantage of a common API is that a script writ-      descs = cdk.Molecule(mol).calcdesc()
ten for one toolkit can easily be modified to use another.
As an example, here is a script that selects molecules that   mol.data.update(descs)
are similar to a particular target molecule. This script is
taken from the original Pybel paper [7], but uses the CDK     mol.write("sdf",             filename             =        "alde
instead of OpenBabel and will run equally well from           hyde.sdf")
Jython and CPython. The only differences compared to
the original script are that "pybel" has been replaced with   For cheminformaticians interested in developing QSAR or
"cdk", and the import statement has been changed from         QSPR models, Cinfony can be used to simultaneously cal-
"import pybel":                                               culate descriptors from the RDKit, the CDK and OpenBa-
                                                              bel. For example, the following script reads a multiline
from cinfony import cdk                                       input file, with each line consisting of a SMILES string fol-
                                                              lowed by a property value. For each molecule, it calculates
targetmol = cdk.readfile("sdf",                 "target       all of the OpenBabel, RDKit and CDK descriptors (except
mol.sdf").next()                                              for CDK's CPSA) and writes out the results as a tab-sepa-

                                                                                                                    Page 6 of 10
  Chem. Cent. J. 2008, 2, 24.                                                               (page number not for citation purposes)
Chemistry Central Journal 2008, 2:24                                        http://journal.chemistrycentral.com/content/2/1/24



rated file suitable for reading with the statistical package R   print >> outputfile,                    "t".join(["Prop
[15]. Note that in this example script, if descriptors share     erty"] + descnames)
the same name only one is retained. This is the case for the
TPSA descriptor in OpenBabel, which is replaced by the           for smile, propval, desc in zip(smiles,
RDKit's TPSA descriptor.                                         propvals, descs):

import string                                                       descvals = [str(desc[descname])                              for
                                                                 descname in descnames]
from cinfony import obabel, cdk, rdkit
                                                                    print >> outputfile, "t".join([smile,
# Read in SMILES strings and observed prop                       str(propval)] +
erty values
                                                                 descvals)
smiles, propvals = [], []
                                                                 outputfile.close()
for line in open("data.txt"):
                                                                 Performance
     broken = line.rstrip().split()                              Accessing cheminformatics libraries using Cinfony allows
                                                                 the user to rapidly develop scripts that manipulate chem-
     smiles.append(broken [0])                                   ical information. However, there is a small price to be
                                                                 paid. Firstly, there is the cost of moving objects across the
     propvals.append(float(broken))                              interface between Python and the cheminformatics librar-
                                                                 ies. Secondly, the additional code required by Cinfony to
mols = [obabel.readstring("smi",                    smile)       implement a standard API may slow performance further.
for smile in smiles]
                                                                 To assess the performance penalty for accessing chem-
#   Calculate         descriptor         values       using      informatics toolkits using Cinfony rather than directly in
OpenBabel,                                                       the native language, we looked at two simple test cases:
                                                                 (1) iterating over an SDF file containing 25419 molecules,
# the CDK (apart from 'CPSA') and the RDKit                      (2) iterating and printing out the molecular weight of
                                                                 each of the molecules. The SDF file used was 3_p0.0.sdf,
cdkdescs = [x for x in cdk.descs if x !=                         the first portion of the drug-like subset of the ZINC 7.00
'CPSA']                                                          dataset [16]. The Cinfony scripts, Java and C++ source
                                                                 code are available as Additional file 2. The results are
descs = []                                                       shown in Table 3.

for mol in mols:                                                 While accessing the CDK using Jython is almost as fast as
                                                                 a pure Java implementation, there is a considerable over-
     d = mol.calcdesc()                                          head associated with using JPype to access the CDK from
                                                                 CPython (89% slower for the second test case). This over-
   d.update(cdk.Molecule(mol).calcdesc(cd                        head is due to passing objects between the JVM and CPy-
kdescs))                                                         thon. For OpenBabel, there is little performance cost
                                                                 associated with accessing OpenBabel from either imple-
     d.update(rdkit.Molecule(mol).calcdesc(                      mentation of Python, although the jybel scripts are some-
))                                                               what slower than pybel scripts. A small portion of this
                                                                 speed difference can be attributed to a slower startup
     descs.append(d)                                             (about 1.6 seconds for jybel, compared to 0.8 seconds for
                                                                 pybel). Finally, from the RDKit results in Table 3, it is clear
# Write a file suitable for 'read.table'                         that using Boost.Python to wrap a C++ library is more effi-
in R                                                             cient than using SWIG. The difference in run times
                                                                 between the C++ and Python implementations is negligi-
outputfile = open("inputforR.txt", "w")                          ble.

descnames = sorted(descs [0].keys(), key =                       In practice, the performance of a particular Cinfony script
string.lower)                                                    will depend on the extent to which information is passed


                                                                                                                      Page 7 of 10
  Chem. Cent. J. 2008, 2, 24.                                                                   (page number not for citation purposes)
Chemistry Central Journal 2008, 2:24                                                  http://journal.chemistrycentral.com/content/2/1/24



Table 3: Performance of Cinfony modules compared to a native Java or C++ implementation.

                                   Iterate over SDF                               Iterate and calculate molecular weight

 CDK                             Time (s)            Normalised                             Time (s)                                Normalised
 Native Java                        21.2                  1.00                                 36.8                                      1.00
 cdkjython                          23.1                  1.09                                 41.6                                      1.13
 cdkjpype                           33.0                  1.57                                 69.5                                      1.89

 OpenBabel
 Native C++                          31.9                   1.00                                43.0                                        1.00
 pybel                               34.1                   1.07                                45.1                                        1.05
 jybel                               38.0                   1.19                                49.6                                        1.15

 RDKit
 Native C++                          99.7                   1.00                              100.7                                         1.00
 rdkit                               99.9                   1.00                              101.0                                         1.00

 The times reported are wallclock times from the best of three runs on a dual-core Intel Pentium 4 3.2 GHz machine with 1GB RAM.



back and forth between Python and the underlying Java or                 ticomponent molecules. For each molecule, PubChem
C++ library. Where most of the time is spent on computa-                 provides an SDF file containing coordinates for a 2D
tion in the underlying library, the speed difference                     depiction, as well as the depiction itself as a PNG file.
between a native implementation and one using Cinfony                    PubChem uses the CACTVS toolkit [18] to generate the
is expected to be small.                                                 2D coordinates as well as the corresponding depiction.
                                                                         Using a script similar to the following, we used Cinfony to
Comparison of toolkits                                                   generate 2D depictions using OASA (the depiction library
Cinfony makes it easy to compare the results obtained by                 used by pybel), the CDK and a development version of
different toolkits for the same operations. This can be use-             RDKit that all use the same 2D coordinates taken from the
ful in identifying bugs, applying a test suite, or finding the           SDF file:
strengths and weaknesses of particular implementations.
For example, where different toolkits calculate the same                 from cinfony import pybel, rdkit
descriptors, if the calculated values are not highly corre-
lated it may indicate a bug in one or the other. Earlier, we             for toolkit in [rdkit, pybel]:
mentioned that a difference in the treatment of implicit
hydrogens causes different toolkits to give different values                  name = toolkit.__name__
for molecular weight unless hydrogens are explicitly
added. Ensuring that a particular result is in agreement                    for mol in                  toolkit.readfile("sdf",
with that obtained by another toolkit can act as a sanity                "dataset.sdf"):
check in such instances to avoid errors.
                                                                               mol.draw(filename                   =    "%s_%s.png"             %
When carrying out the same operation with several                        (mol.title, name),
toolkits, it is often convenient to iterate over the toolkits
in an outer loop:                                                                       show = False,

from cinfony import obabel, rdkit, cdk                                                  usecoords = True)

for toolkit in [obabel, rdkit, cdk]:                                     When the resulting images were compared for the
                                                                         PubChem entry CID7250053, an error was found in the
   print                  toolkit.readstring("smi",                      depiction of the stereochemistry of an isopropyl group
"CCC").molwt                                                             (Figure 2). Since the error only occurred in certain cases, it
                                                                         had not been previously noticed and would have been dif-
As an example of how such comparisons can be used to                     ficult to identify without such a comparative study. Once
identify bugs in toolkits, let us consider depiction. As a               reported, the problem was quickly solved and the subse-
dataset, we randomly chose 100 molecules from                            quent RDKit release depicted the stereochemistry cor-
PubChem [17], with subsequent filtering to remove mul-                   rectly. A comparison of depictions by commercial toolkits


                                                                                                                                 Page 8 of 10
  Chem. Cent. J. 2008, 2, 24.                                                                              (page number not for citation purposes)
Chemistry Central Journal 2008, 2:24                                           http://journal.chemistrycentral.com/content/2/1/24



                                                               Other requirements: OpenBabel, CDK, RDKit, Java,
                                                               OASA, JPype, Python Imaging Library

                                                               License: BSD

                                                               Any restrictions to use by non-academics: None

                                                               Competing interests
                                                               The authors declare that they have no competing interests.

                                                               Authors' contributions
                                                               NMOB conceived and developed Cinfony. GRH is the
                                                               lead developer of OpenBabel and created the Python and
                                                               Java SWIG bindings. All authors read and approved the
                                                               final manuscript.

                                                               Additional material

                                                                    Additional file 1
                                                                    Miniwebsite API. A mini-website of the Cinfony API documentation.
                                                                    Click here for file
                                                                    [http://www.biomedcentral.com/content/supplementary/1752-
Figure
different2toolkits
Comparison of depictions of PubChem CID7250053 using                153X-2-24-S1.zip]
Comparison of depictions of PubChem CID7250053
using different toolkits. The depiction using the develop-          Additional file 2
ment version of RDKit showed incorrect stereochemistry              Timing Code. A zip file containing Python, Java and C++ code used for
for the isopropyl substituent of the thiazole ring.                 run time comparisons for two test cases.
                                                                    Click here for file
                                                                    [http://www.biomedcentral.com/content/supplementary/1752-
                                                                    153X-2-24-S2.zip]
and depictions generated by Cinfony is available here (see
Additional file 3).                                                 Additional file 3
                                                                    Miniwebsite Depictions. A mini-website showing a comparison of the
Conclusion                                                          depictions generated by several cheminformatics toolkits.
Cinfony makes it easy to combine complementary fea-                 Click here for file
                                                                    [http://www.biomedcentral.com/content/supplementary/1752-
tures of the three main Open Source cheminformatics
                                                                    153X-2-24-S3.zip]
toolkits. By presenting a standard simplified API, the
learning curve associated with starting to use a new toolkit
is greatly reduced, thus encouraging users of one toolkit to
investigate the potential of others.
                                                               Acknowledgements
                                                               Cinfony would not be possible without the work of many Open Source
Cinfony is freely available from the Cinfony website [19],     projects. In particular, we thank several developers who responded quickly
both as Python source code and as a Windows distribu-          to bug reports or queries: Beda Kosata (OASA), Greg Landrum (RDKit),
tion containing dependencies. Installation instructions        Tim Vandermeersch (OpenBabel), Steve Ménard (JPype). Thanks also to
are provided for MacOSX, Linux and Windows.                    Gilbert Mueller and Chris Morley for feedback on installing Cinfony.
                                                               NMOB thanks Google Code for providing free web hosting and develop-
                                                               ment tools for Cinfony. We thank the anonymous reviewers for several
Availability and requirements
                                                               useful suggestions.
Project name: Cinfony
                                                               References
Project home page: http://cinfony.googlecode.com               1.      OpenBabel v.2.2.0 [http://openbabel.org]
                                                               2.      Steinbeck C, Hoppe C, Kuhn S, Floris M, Guha R, Willighagen E:
Operating system(s): Platform independent                              Recent Developments of the Chemistry Development Kit
                                                                       (CDK) – An Open-Source Java Library for Chemo- and Bio-
                                                                       informatics. Curr Pharm Des 2006, 12:2110-2120.
Programming language: Python, Jython                           3.      Landrum G: RDKit. [http://www.rdkit.org].
                                                               4.      Murray-Rust P, Rzepa HS: Chemical Markup, XML, and the
                                                                       Worldwide Web. 1. Basic Principles. J Chem Inf Comput Sci 1999,
                                                                       39:928-942.



                                                                                                                            Page 9 of 10
  Chem. Cent. J. 2008, 2, 24.                                                                         (page number not for citation purposes)
Chemistry Central Journal 2008, 2:24                                                     http://journal.chemistrycentral.com/content/2/1/24



5.    Apodaca R, O'Boyle N, Dalke A, Van Drie J, Ertl P, Hutchison G,
      James CA, Landrum G, Morley C, Willighagen E, De Winter H:
      OpenSMILES. [http://www.opensmiles.org].
6.    Daylight Chemical Information Systems Manual               [http://
      www.daylight.com/dayhtml/doc/theory/theory.smiles.html]
7.    O'Boyle NM, Morley C, Hutchison GR: Pybel: a Python wrapper
      for the OpenBabel cheminformatics toolkit. Chem Cent J 2008,
      2:5.
8.    Kosata B: OASA. [http://bkchem.zirael.org/oasa_en.html].
9.    Raymond ES: The Art of UNIX Programming 2003 [http://www.catb.org/
      ~esr/writings/taoup/index.html]. Reading, MA: Addison-Wesley
10.   Symyx CTfile formats [http://www.mdli.com/downloads/public/
      ctfile/ctfile.jsp]
11.   KNIME – Konstanz Information Miner [http://knime.org]
12.   SWIG v.1.3.36 [http://www.swig.org]
13.   Ménard S: JPype. [http://jpype.sf.net].
14.   Boost.Python [http://www.boost.org/libs/python/doc/]
15.   R development core team: R: A language and environment for
      statistical computing. [http://www.R-project.org].
16.   Irwin JJ, Shoichet BK: ZINC – A Free Database of Commercially
      Available Compounds for Virtual Screening. J Chem Inf Model
      2005, 45:177-182.
17.   PubChem [http://pubchem.ncbi.nlm.nih.gov/]
18.   CACTVS Chemoinformatics Toolkit: Xemistry GmbH: Lah-
      ntal, Germany. .
19.   O'Boyle NM: Cinfony. [http://cinfony.googlecode.com].




                                                                            Publish with ChemistryCentral and every
                                                                            scientist can read your work free of charge
                                                                                      Open access provides opportunities to our
                                                                                  colleagues in other parts of the globe, by allowing
                                                                                      anyone to view the content free of charge.
                                                                                                         W. Jeffery Hurst, The Hershey Company.
                                                                              available free of charge to the entire scientific community
                                                                              peer reviewed and published immediately upon acceptance
                                                                              cited in PubMed and archived on PubMed Central
                                                                              yours you keep the copyright
                                                                            Submit your manuscript here:
                                                                            http://www.chemistrycentral.com/manuscript/




                                                                                                                                         Page 10 of 10
     Chem. Cent. J. 2008, 2, 24.                                                                                    (page number not for citation purposes)
O’Boyle et al. Journal of Cheminformatics 2011, 3:33
   http://www.jcheminf.com/content/3/1/33




    SOFTWARE                                                                                                                                     Open Access

   Open Babel: An open chemical toolbox
   Noel M O’Boyle1, Michael Banck2, Craig A James3, Chris Morley4, Tim Vandermeersch4 and Geoffrey R Hutchison5*


     Abstract
     Background: A frequent problem in computational modeling is the interconversion of chemical structures
     between different formats. While standard interchange formats exist (for example, Chemical Markup Language) and
     de facto standards have arisen (for example, SMILES format), the need to interconvert formats is a continuing
     problem due to the multitude of different application areas for chemistry data, differences in the data stored by
     different formats (0D versus 3D, for example), and competition between software along with a lack of vendor-
     neutral formats.
     Results: We discuss, for the first time, Open Babel, an open-source chemical toolbox that speaks the many
     languages of chemical data. Open Babel version 2.3 interconverts over 110 formats. The need to represent such a
     wide variety of chemical and molecular data requires a library that implements a wide range of cheminformatics
     algorithms, from partial charge assignment and aromaticity detection, to bond order perception and
     canonicalization. We detail the implementation of Open Babel, describe key advances in the 2.3 release, and
     outline a variety of uses both in terms of software products and scientific research, including applications far
     beyond simple format interconversion.
     Conclusions: Open Babel presents a solution to the proliferation of multiple chemical file formats. In addition, it
     provides a variety of useful utilities from conformer searching and 2D depiction, to filtering, batch conversion, and
     substructure and similarity searching. For developers, it can be used as a programming library to handle chemical
     data in areas such as organic chemistry, drug design, materials science, and computational chemistry. It is freely
     available under an open-source license from http://openbabel.org.


   Introduction                                                                        indication of biomolecular residues, or multiple
   The history of chemical informatics has included a huge                             conformations.
   variety of textual and computer representations of mole-                              While attempts have been made to provide a standard
   cular data. Such representations focus on specific atomic                           format for storing chemical data, including most notably
   or molecular information and may not attempt to store                               the development of Chemical Markup Language (CML)
   all possible chemical data. For example, line notations                             [2-6], an XML dialect, such formats have not yet
   like Daylight SMILES [1] do not offer coordinate infor-                             achieved widespread use. Consequently, a frequent pro-
   mation, while crystallographic or quantum mechanical                                blem in computational modeling is the interconversion
   formats frequently do not store chemical bonding data.                              of molecular structures between different formats, a pro-
   Hydrogen atoms are frequently omitted from x-ray crys-                              cess that involves extraction and interpretation of their
   tallography due to the difficulty in establishing coordi-                           chemical data and semantics.
   nates, and are often ignored by some file formats as the                              We outline for the first time, the development and use
   “implicit valence” of heavy atoms that indicates their                              of the Open Babel project, a full-featured open chemical
   presence. Other types of representations require specifi-                           toolbox, designed to “speak” the many different repre-
   cation of atom types on the basis of a specific valence                             sentations of chemical data. It allows anyone to search,
   bond model, inclusion of computed partial charges,                                  convert, analyze, or store data from molecular modeling,
                                                                                       chemistry, solid-state materials, biochemistry, or related
                                                                                       areas. It provides both ready-to-use programs as well as
   * Correspondence: geoffh@pitt.edu
   5
    University of Pittsburgh, Department of Chemistry, 219 Parkman Avenue,             a complete, extensible programmer’s toolkit for develop-
   Pittsburgh, PA 15217, USA                                                           ing cheminformatics software. It can handle reading,
   Full list of author information is available at the end of the article

                                         © 2011 O’Boyle et al; licensee Chemistry Central Ltd. This is an Open Access article distributed under the terms of the Creative
                                         Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
                                         reproduction in any medium, provided the original work is properly cited.




J. Cheminf. 2011, 3, 33.
O’Boyle et al. Journal of Cheminformatics 2011, 3:33                                                         Page 2 of 14
http://www.jcheminf.com/content/3/1/33




writing, and interconverting over 110 chemical file for-     substructure searching (see below); the MolPrint2D and
mats, supports filtering and searching molecule files        Multilevel Neighborhoods of Atoms formats calculate cir-
using Daylight SMARTS pattern matching [7] and other         cular fingerprints defined by Bender et al. [15,16] and
methods, and provides extensible fingerprinting and          Filimonov et al. [17,18] respectively.
molecular mechanics frameworks. We will discuss the            Each format can have multiple options to control
frameworks for file format interconversion, fingerprint-     either reading or writing a particular format. For exam-
ing, fast molecular searching, bond perception and atom      ple, the InChI format has 12 options including an
typing, canonical numbering of molecular structures and      option “K” to generate an InChIKey, “T <param>“ to
fragments, molecular mechanics force fields, and the         truncate the InChI depending on a supplied parameter
extensible interfaces provided by the software library to    and “w” to ignore certain InChI warnings. The available
enable further chemistry software development.               options are listed in the documentation, are shown in
  Open Babel has its origin in a version of OELib            the Graphical User Interface (GUI) as checkboxes or
released as open-source software by OpenEye Scientific       textboxes, and can be listed at the command-line. In
under the GPL (GNU Public License). In 2001, OpenEye         fact, all three are generated from the same source; a
decided to rewrite OELib in-house as the proprietary         documentation string in the C++ code.
OEChem library, so the existing code from OELib was
spun out into the new Open Babel project. Since 2001,        Fingerprints and Fast Searching
Open Babel has been developed and substantially              Databases are widely used to store chemical information
extended as an international collaborative project using     especially in the pharmaceutical industry. A key require-
an open-source development model [8]. It has over            ment of such a database is the ability to index chemical
160,000 downloads, over 400 citations [9], is used by        structures so that they can be quickly retrieved given a
over 40 software projects [10], and is freely available      query substructure. Open Babel provides this functional-
from the Open Babel website [11].                            ity using a path-based fingerprint. This fingerprint,
                                                             referred to as FP2 in Open Babel, identifies all linear
Features                                                     and ring substructures in the molecule of lengths 1 to 7
File Format Support                                          (excluding the 1-atom substructures C and N) and maps
With the release of Open Babel 2.3, Open Babel sup-          them onto a bit-string of length 1024 using a hash func-
ports 111 chemical file formats in total. It can read 82     tion. If a query molecule is a substructure of a target
formats and write 85 formats. These encompass com-           molecule, then all of the bits set in the query molecule
mon formats used in cheminformatics (SMILES, InChI,          will also be set in the target molecule. The fingerprints
MOL, MOL2), input and output files from a variety of         for two molecules can also be used to calculate struc-
computational chemistry packages (GAMESS, Gaussian,          tural similarity using the Tanimoto coefficient, the num-
MOPAC), crystallographic file formats (CIF, ShelX),          ber of bits in common divided by the union of the bits
reaction formats (MDL RXN), file formats used by             set.
molecular dynamics and docking packages (AutoDock,              Clearly, repeated searching of the same set of mole-
Amber), formats used by 2D drawing packages (Chem-           cules will involve repeated use of the same set of finger-
Draw), 3D viewers (Chem3D, Molden) and chemical              prints. To avoid the need to recalculate the fingerprints
kinetics and thermodynamics (ChemKin, Thermo). For-          for a particular multi-molecule file (such as an SDF file),
mats are implemented as “plugins” in Open Babel,             Open Babel provides a fastindex format that solely
which makes it easy for users to contribute new file for-    stores a fingerprint along with an index into the original
mats (see Extensible Interface below). Depending on the      file. This index leads to a rapid increase in the speed of
format, other data is extracted by Open Babel in addi-       searching for matches to a query - datasets with several
tion to the molecular structure; for example, vibrational    million molecules are easily searched interactively. In
frequencies are extracted from computational chemistry       this way, a multi-molecule file may be used as a light-
log files, unit cell information is extracted from CIF       weight alternative to a chemical database system.
files, and property fields are read from SDF files.
   A number of “utility” file formats are also defined;      Bond Perception and Atom Typing
these are not strictly speaking a way of storing the         As mentioned above, many chemical file formats offer
molecular structure, but rather present certain function-    representations of molecular data solely as lists of
ality through the same interface as the regular file for-    atoms. For example, most quantum chemical software
mats. For example, the report format is a write-only         packages and most crystallographic file formats do not
utility format [12] that presents a summary of the mole-     offer definitions of bonding. A similar situation occurs
cular structure of a molecule; the fingerprint format [13]   in the case of the Protein Data Bank (PDB) format;
and fastsearch format [14] are used for similarity and       while standardized [19] files contain connectivity



J. Cheminf. 2011, 3, 33.
O’Boyle et al. Journal of Cheminformatics 2011, 3:33                                                          Page 3 of 14
http://www.jcheminf.com/content/3/1/33




information, non-standard files exist that often do not        determined, an exhaustive search is performed to assign
provide full connectivity information. Consequently,           single and double bonds to satisfy all valences in a
Open Babel features methods to determine bond con-             Kekulé form. Since this process is exponential in com-
nectivity, bond order perception, aromaticity determina-       plexity, the algorithm will terminate if more than 30
tion, and atom typing.                                         levels of recursion or 15 seconds are exceeded (which
   Bond connectivity is determined by the frequently           may occur in the case of large fused ring systems such
used algorithm of detecting atoms closer than the sum          as carbon nanotubes).
of their covalent radii, with a slight tolerance (0.45 Å) to
allow for longer than typical bonds. To handle disorder        Canonical Representation of Molecules
in crystallographic data (e.g., PDB or CIF files), atoms       In general, for any particular molecular structure and
closer than 0.63 Å are not bonded. A further filtering         file format, there are a large number of possible ways
pass is made to ensure standard bond valency is main-          the structure could be stored; for example, there are N!
tained; each element has a maximum number of bonds,            ways of ordering the atoms in an MOL file. While each
if this is exceeded then the longest bonds to an atom          of the orderings encodes exactly the same information,
are successively removed until the valence rule is             it can be useful to define a canonical numbering of the
fulfilled.                                                     atoms of a molecule and use this to derive a canonical
   After bond connectivity is determined, if needed or         representation of a molecule for a particular file format.
requested by the user, bond order perception is per-           For a zero-dimensional file format without coordinates,
formed on the basis of bond angles and geometries. The         such as SMILES, the canonical representation could be
method is similar to that proposed by Roger Sayle [20]         used to index a database, remove duplicates or search
and uses the average bond angle around an un-typed             for matches.
atom to determine sp and sp 2 hybridized centers. 5-              Open Babel implements a sophisticated canonicaliza-
membered and 6-membered rings are checked for pla-             tion algorithm that can handle molecules or molecular
narity to estimate aromaticity. Finally, atoms marked as       fragments. The atom symmetry classes are the initial
unsaturated are checked for an unsaturated neighbor to         graph invariants and encode topological and chemical
give a double or triple bond. After this initial atom typ-     properties. A cooperative labeling procedure is used to
ing, known functional groups are matched, followed by          investigate the automorphic permutations to find the
aromatic rings, followed by remaining unsatisfied bonds        canonical code. Although the algorithm is similar to the
based on a set of heuristics for short bonds, atomic elec-     original Morgan canonical code [21], various improve-
tronegativity, and ring membership.                            ments are implemented to improve performance. Most
   Atom typing is performed by “lazy evaluation,” match-       notably, the algorithm implements heuristics from the
ing atoms against SMARTS patterns to determine hybri-          popular nauty package [22,23]. Another aspect handled
dization, implicit valence, and external atom types.           by the canonical code is stereochemistry as different
Atom type perception may be triggered by adding                labelings can lead to different parities. This is further
hydrogens (which requires determination of implicit and        complicated by the possibility of symmetry-equivalent
explicit valence), exporting to a file format that requires    stereocenters and stereocenters whose configuration is
atom types, or as requested by the user. To minimize           interdependent. The full details will be the subject of a
the amount of typing required, when importing from a           separate publication.
format with atom types specified, a lookup table is used
to translate between equivalent types.                         Coordinate Generation in 2D and 3D
   An important part of atom typing is aromaticity detec-      Open Babel, version 2.3, has support for 2D coordinate
tion and assignment of Kekulé bond orders (kekuliza-           generation (Figure 1) through the donation of code by
tion). In Open Babel, a central aromaticity model is           Sergei Trepalin, based on the code used in the MCDL
used, largely matching the commonly used Daylight              chemical structure editor [24-26]. The MCDL algorithm
SMILES representation [1], but with added support for          aims to layout the molecular structure in 2D such that
aromatic phosphorous and selenium. Potential aromatic          all bond lengths are equal and all bond angles are close
atoms and bonds are flagged on the basis of member-            to 120°. The layout algorithm includes a small database
ship in a ring system possibly containing 4n+2 π elec-         of around 150 templates to help layout cages and large
trons. Aromaticity is established only if a well-defined       fragment cycles. To deal with the problem of overlap-
valence bond Kekulé pattern can be determined. To do           ping fragments, the algorithm includes an exhaustive
this, atoms are added to a ring system and checked             search procedure that rotates around acyclic bonds by
against the 4n+2 π electron configuration, gradually           180°.
increasing the size to establish the largest possible con-       Coordinate generation in 3D was introduced in Open
nected aromatic ring system. Once this ring system is          Babel version 2.2, and improved in version 2.3, to enable



J. Cheminf. 2011, 3, 33.
O’Boyle et al. Journal of Cheminformatics 2011, 3:33                                                                       Page 4 of 14
http://www.jcheminf.com/content/3/1/33




                                                                         tetrahedral stereochemistry and square-planar stereo-
                                                                         chemistry (this last is still under development), as well
                                                                         as perception routines for 2D and 3D geometries, and
                                                                         routines to query and alter the stereochemistry.
                                                                            The detection of stereogenic units starts with an ana-
                                                                         lysis of the graph symmetry of the molecule to identify
                                                                         the symmetry class of each atom. However, given that a
                                                                         complete symmetry analysis also needs to take stereo-
                                                                         chemistry into account, this means that the overall
                                                                         stereochemistry can only be found iteratively. At each
                                                                         iteration, the current atom symmetry classes are used to
                                                                         identify stereogenic units. For example, a tetrahedral
 Figure 1 Interconversion of 0D, 2D and 3D structures. The
 structures shown are of sertraline, a selective serotonin reuptake
                                                                         center is identified as chiral if it has four neighbors with
 inhibitor (SSRI) used in the treatment of depression. A SMILES string   different symmetry classes (or three, in the case where a
 for sertraline is shown at the top; this can be considered a 0D         lone pair gives rise to the tetrahedral shape).
 structure (only connectivity and stereochemical information). From
 this, Open Babel can generate a 2D structure (bottom left, depicted     Forcefields
 by Open Babel) or a 3D structure (bottom right, depicted by
 Avogadro), and all of these can be interconverted.                      Molecular mechanics functions are provided for use
                                                                         with small molecules. Typical applications include
                                                                         energy evaluation or minimization, alone or as part of a
conversion from 0D formats such as SMILES to 3D for-                     larger workflow. The selection of implemented force
mats such as SDF (Figure 1). The 3D structure genera-                    fields allows most molecular structures to be used and
tor builds linear components from scratch following                      parameters to be assigned automatically. The MMFF94
geometrical rules based on the hybridization of the                      (s) force field can be used for organic or drug-like mole-
atoms. Single-conformer ring templates are used for                      cules [27-31]. For molecules containing any element of
ring systems. The template matching algorithm iterates                   the periodic table or complex geometry (i.e. not sup-
through the templates from largest to smallest searching                 ported by MMFF94), the UFF force field can be used
for matches. If a match is found, the algorithm con-                     instead [33]. Recently, code implementing the GAFF
tinues but will not match any ring atoms previously                      force field [34,35] was also contributed and released as
templated except in the case of a single overlap (the two                part of version 2.3. All of the forcefields allow the appli-
ring systems of a spiro group) or an overlap involving                   cation of constraints on particular atom positions, or
exactly two adjacent atoms (two fused ring systems).                     particular distances.
After an initial structure is generated, the stereochemis-                  Several conformer searching methods have been
try (cis/trans and tetrahedral) is corrected to match the                implemented using the forcefields, all based on the “tor-
input structure. Finally, the energy of the structure is                 sion-driving” approach. This approach involves setting
minimized using the MMFF94 forcefield [27-31] and a                      torsion angles from a set of predefined allowed values
low energy conformer found using a weighted rotor                        for a particular rotatable bond. The most thorough
search.                                                                  search method implemented is a systematic search
  While the 3D structure builder produces reasonable                     method, which iterates over all of the allowed torsion
conformations for molecules without rings or with ring                   angles for each rotatable bond in the molecule and
systems for which a template exists, the results may be                  retains the conformer with the lowest energy. Since a
poor for molecules with more complex ring systems or                     systematic search may not be feasible for a molecule
organometallic species. Future work will be performed                    with multiple rotatable bonds, a number of stochastic
to compare the results of Open Babel with other pro-                     search methods are also available: the random search
grams with respect to both speed and the quality of the                  method, which tries random settings for the torsion
generated structures [32].                                               angles (from the predefined allowed values), and a
                                                                         weighted rotor search, a stochastic search method that
Stereochemistry                                                          converges on a low energy conformer by weighting par-
A recent focus of Open Babel development has been to                     ticular torsion angles based on the relative energy of the
ensure robust translation of stereochemical information                  generated conformer. With Open Babel 2.3, conformer
between file formats. This is particularly important                     search based on a genetic algorithm is also available
when dealing with 0D formats as these explicitly encode                  which allows the application of filters (e.g. a diversity fil-
the perceived stereochemistry. Open Babel 2.3 includes                   ter) and different scoring functions. This latter method
classes to handle cis/trans double bond stereochemistry,                 can be used to generate a library of diverse conformers,



J. Cheminf. 2011, 3, 33.
O’Boyle et al. Journal of Cheminformatics 2011, 3:33                                                              Page 5 of 14
http://www.jcheminf.com/content/3/1/33




or like the other methods to seek a low energy confor-
mer [36].

Implementation
Technical Details
Open Babel is implemented in standards-compliant C+
+. This ensures support for a wide variety of C++ com-
pilers (MSVC, GCC, Intel Compiler, MinGW, Clang),
operating systems (Windows, Mac OS X, Linux, BSD,
Windows/Cygwin) and platforms (32-bit, 64-bit). Since
version 2.3, it is compiled using the CMake build system
[37,38]. This is an open-source cross-platform build sys-
tem with advanced features for dependency analysis.
The build system has an associated unit test framework
CTest, which allows nightly builds to be compiled and
tested automatically with the results collated and dis-
played on a centralized dashboard [39].
                                                              Figure 2 Architecture of the Open Babel codebase.
  To simplify installation Open Babel has as few exter-
nal dependencies as possible. Where such dependencies
exist, they are optional. For example, if the XML devel-       The code base can be considered as consisting of the
opment libraries are not available, Open Babel will still    following modules (Figure 2):
compile successfully but none of the XML formats
(such as Chemical Markup Language, CML) will be                 • The Chemical Core, which contains OBMol etc.
available. Similarly, if the Eigen matrix and linear alge-      and has all of the chemical structure description and
bra library is not found, any classes that require fast         manipulation. This is the heart of the application
matrix manipulation (such as OBAlign, which performs            and its API can be used as a chemical toolbox. It
least squares alignment) will not be compiled.                  has no input/output capabilities.
  While the majority of the Open Babel library is writ-         • The Formats, which read and write to files of dif-
ten in C++, bindings have been developed for a range of         ferent types. These classes are derived from a com-
other programming languages, including Java and the .           mon base class, OBFormat, which is in the
NET platform, as well as the so-called “dynamic” script-        Conversion Control module. They also make use of
ing languages Perl, Python, and Ruby. These are auto-           the chemical routines in the Chemical Core module.
matically generated from the C++ header files using the         Each format file contains a global object of the for-
SWIG tool. As described previously [40], in the case of         mat class. When the format is loaded the class con-
Python an additional module is provided named Pybel             structor registers the presence of the class with
that simplifies access to the C++ bindings. These inter-        OBConversion. This means that the formats are plu-
faces facilitate development of web-enabled chemistry           gins - new formats can be added without changing
applications, as well as rapid development and                  any framework code.
prototyping.                                                    • Common Formats include OBMoleculeFormat and
                                                                XMLBaseFormat from which most other formats
Code Architecture                                               (like Format A and Format B in the diagram) are
The Open Babel codebase has a modular design as                 derived. Independent formats like Format C are also
shown in Figure 2. The goal of this design is threefold:        possible.
                                                                • The Conversion Control, which also keeps track of
    1. To separate the chemistry, the conversion process        the available formats, the conversion options and the
    and the user interfaces reducing, as far as possible,       input and output streams. It can be compiled with-
    the dependency of one upon another.                         out reference to any other parts of the program. In
    2. To put all of the code for each chemical format in       particular, it knows nothing of the Chemical Core:
    one place (usually a single file) and make the addi-        mol.h is not included.
    tion of new formats simple.                                 • The User Interface, which may be a command line
    3. To allow the format conversion of not just mole-         application, a Graphical User Interface (GUI), or
    cules, but also any other chemical objects, such as         may be part of another program that uses Open
    reactions.                                                  Babel’s input and output facilities. This depends only




J. Cheminf. 2011, 3, 33.
O’Boyle et al. Journal of Cheminformatics 2011, 3:33                                                                Page 6 of 14
http://www.jcheminf.com/content/3/1/33




    on the Conversion Control module (obconversion.h           citation in science. The rights granted by open source
    is included), but not on the Chemical Core or on           licenses largely coincide with the norms of scientific
    any of the Formats.                                        ethics to enable verifiability, repeatability, and building
    • The Fingerprint API, as well as being usable in          on previous results and theories.
    external programs, is employed by the fastsearch and          Beyond these rights, Open Babel (like most other
    fingerprint formats.                                       open-source projects) offers open development – that is,
    • The Fingerprints, which are bit arrays that describe     all development occurs in public forums and with public
    an object and which facilitate fast searching. They        code repositories. This results in greater input from the
    are also built as plugins, registering themselves with     community as any user can easily submit bug reports or
    their base class OBFingerprint which is in the Fin-        feature suggestions, get involved in discussions on the
    gerprint API.                                              future direction of Open Babel or even become a devel-
    • Other features such as Forcefields, Partial Charge       oper him/herself. In practice, the number of active con-
    Models and Chemical Descriptors, although not              tributors has increased over time through this level of
    shown in the diagram, are handled similarly to             open, public development (Figure 3). Moreover, it
    Fingerprints.                                              means that the development of the code is completely
    • The Error Handling can be used throughout the            transparent and the quality of the software is available
    program to log and display errors and warnings.            for public scrutiny. Indeed, since its inception, over 658
                                                               bugs have been submitted to the public tracker and
                                                               fixed [41].
Extensible Interface
The utility of software libraries such as Open Babel           Validation and Testing
depends on the ability of the design to be extended over       Open Babel includes an extensive test suite comprising
time to support new functionality. To facilitate this,         60 different test programs each with tens to hundreds of
Open Babel implements a plugin interface for file for-         tests. In early 2010, a nightly build infrastructure and
mats, fingerprints, charge models, descriptors, “opera-        dashboard was put in place with support from Kitware,
tors” and molecular mechanics force fields. This ensures       Inc. This has greatly improved code quality by catching
a clean separation of the implementation of a particular       regressions, and also ensures that the code compiles
plugin from the core Open Babel library code, and              cleanly on all platforms and compilers supported by
makes it easy for a new plugin (e.g. a new file format) to     Open Babel. Some examples of tests that are run each
be contributed; all that is needed is a single C++ file and    night are:
a trivial change to one of the build files. The operator
plugins provide a very general mechanism for operating            (1) The MMFF94 forcefield code is tested against the
on a molecule (e.g. energy minimization or 3D coordi-             MMFF94 validation suite.
nate generation) or on a list of molecules (e.g. filtering
or sorting) after reading but before writing.
  Plugins are dynamically loaded at runtime. This
decreases the overall disk and memory footprint of
Open Babel, allowing external developers to choose par-
ticular functionality needed for their application and
ignore other, less relevant features. It also allows the
possibility of a third-party distributing plugins separately
to the Open Babel distribution to provide additional
functionality.

Open-Source License and Open Development
Open Babel is open-source software, which offers end
users and third-party developers a range of additional
rights not granted by proprietary chemistry software.
Open-source software, at its most basic level, grants
users the rights to study how their software works, to
adapt it for any purpose or otherwise modify it, and to         Figure 3 Number of contributors over time. Note that this graph
share the software and their modifications with others.         only includes developers who directly commited code to the Open
In this sense, Open Source functions in similar ways to         Babel source code repository, and does not include patches
                                                                provided by users.
the processes of open peer review, publication, and



J. Cheminf. 2011, 3, 33.
O’Boyle et al. Journal of Cheminformatics 2011, 3:33                                                          Page 7 of 14
http://www.jcheminf.com/content/3/1/33




    (2) The OBAlign class, which was developed using
    Test-Driven Development (TDD) methodology, is
    run against its test suite.
    (3) Handling of symmetry is validated by converting
    several test cases between SMILES, 2D and 3D SDF,
    and InChI (there are also several test programs with
    unit tests for the individual stereo classes in the
    API).
    (4) The SMARTS parser is tested using over 250            Figure 4 The two failures found in the validation test for
                                                              reading/writing SMILES.
    valid and invalid SMARTS patterns, and the
    SMARTS matcher is tested using 125 basic
    SMARTS patterns.
    (5) The LSSR (Least Set of Smallest Rings) code is       meso compound and so both SMILES strings are cor-
    tested for invariance against changing the atom          rect and represent the same molecule. However the
    order for a series of polycyclic molecules.              canonicalization algorithm should have chosen one
                                                             stereochemistry or the other for the canonical
  Recently the development team has placed a major           representation.
focus on increasing the robustness of file format transla-      Another area of focus was the canonicalization algo-
tion particularly in relation to the commonly used           rithm, which can be used to generate canonical SMILES
SMILES and MDL Molfile formats. Translating between          as well as other formats. The algorithm can be tested by
these formats requires accurate stereochemistry percep-      ensuring that the same canonical SMILES string is
tion, inference of implicit hydrogens, and kekulization of   obtained even when the order of atoms in a molecule is
delocalized systems. While it is difficult to ensure that    changed (while retaining the same connection table).
any complex piece of code is free of bugs, and Open          The test stresses all areas of the library, including aro-
Babel is no exception, validation procedures can be car-     maticity perception, kekulization, stereochemistry, and
ried out to assess the current level of performance and      canonicalization. The development of the canonicaliza-
to find additional test cases that expose bugs. The fol-     tion code in Open Babel was guided by applying this
lowing procedure was used to guide the rewriting of          test to the 5,151,179 molecules in the eMolecules catalo-
stereochemistry code in Open Babel, a project that           gue (dated 2011-01-02) with 10 random shuffles of the
began in early 2009. Starting with a dataset of 18,084       atom order. At the time of the Open Babel 2.2.3 release,
3D structures from PubChem3D as an SDF file, we              there were 24,404 failures of the canonicalization algo-
compared the result of (a) conversion to SMILES, fol-        rithm; this has now been reduced to only four (shown
lowed by conversion of that to Canonical SMILES to (b)       in Figure 5, < 0.001%). The Open Babel nightly test
conversion directly to Canonical SMILES. This proce-         suite ensures that this test passes for a number of pro-
dure can be used to flush out errors in reading the ori-     blematic molecules. Although the canonicalization algo-
ginal SDF file, reading/writing SMILES (either due to        rithm is still not perfect, we believe that the current
stereochemistry errors or kekulization problems), and is     level of performance (99.99992% success on the eMole-
also a test (to some extent) of the canonicalization code.   cules catalogue) is acceptable for general use and with
At the time of starting this work (March 2009), the          time we intend to improve performance further.
error rate found was 1424 (8%); by Oct 2009, combined           Given that the error rate for canonicalization and
work on stereochemistry, kekulization and canonicaliza-      handling of stereochemistry is now quite low, the next
tion had reduced this to 190 (~1%), and continued            area of focus for the Open Babel development team is
improvements have reduced the number of errors down          to improve the handling of implicit valence for “unusual
to two (shown in Figure 4) for Open Babel 2.3.1              atoms.” This is particularly important for organometallic
(~0.01%). The first failure is due to a kekulization error   species and inorganic complexes.
in a polycyclic aromatic molecule incorporating heteroa-
toms: (a) gave c1ccc2c(c1)c1[nH][nH]c3c4c1c(c2)              Using Open Babel
ccc4cc1c3cccc1 while (b) gave c1ccc2c(c1)c1nnc3c4c1c         Applications
(c2)ccc4cc1c3cccc1. This error led to confusion over         The Open Babel package is composed of a set of user
whether or not the aromatic nitrogens have hydrogens         applications as well as a programming library. The main
attached (they do not). The second failure involves con-     command line application provided is obabel (a small
fusion over the canonical stereochemistry at a bridge-       upgrade on the earlier babel), which facilitates file for-
head carbon: (a) gave C1CN2[C@@H](C1)CCC2 while              mat conversion, filtering (by SMARTS, title, descriptor
(b) gave C1CN2[C@H](C1)CCC2. This is actually a              value, or property field), 3D or 2D structure generation,



J. Cheminf. 2011, 3, 33.
O’Boyle et al. Journal of Cheminformatics 2011, 3:33                                                                     Page 8 of 14
http://www.jcheminf.com/content/3/1/33




 Figure 5 The four failures found in the validation test for canonicalization.



conversion of hydrogens from implicit to explicit (and                  use in programs. Documentation on the complete API
vice versa), and removal of small fragments or of dupli-                (generated using Doxygen [42]) is available from the
cate structures. A number of features are provided to                   Open Babel website [43], or can be generated from the
handle multi-molecule file formats (such as SDF or                      source code.
MOL2) and to use or manipulate the information in                          The functionality provided by the Open Babel library
property fields and molecule titles. Here is an example                 is relied upon by many users and by several other soft-
of using obabel to convert from SDF format to SMILES:                   ware projects, with the result that introducing changes
  obabel inputmols.sdf -O outputmols.smi                                to the API would cause existing software to break. For
  A more complicated use would be to extract all mole-                  this reason, Open Babel strives to maintain API stabi-
cules in an SDF file whose titles start with “active":                  lity over long periods of time, so that existing software
  obabel inputmols.sdf -aT -o copy -O out-                              will continue to work despite the release of new Open
putmols.sdf –filter “title=’active*’”                                   Babel versions with additional features, file formats
  The copy format specified by “-o copy” is a utility for-              and bug fixes. Open Babel uses a version numbering
mat that copies the exact contents of the input file (for               system that indicates how the API has changed with
the filtered molecules) directly to the output, without                 every release:
perception or interpretation. The “-aT” indicates that
only the title of the input SDF file should be read; full                    • Bug fix releases (e.g. 2.0.0 versus 2.0.1) do not
chemical perception is not required.                                         change API at all
  The Open Babel graphical user interface (GUI) pro-                         • Minor version releases (e.g. 2.0 versus 2.1) will add
vides the same functionality. Figure 6 is a screenshot of                    to the API, but will otherwise be backwards-
the GUI carrying out the same filtering operation                            compatible
described in the obabel example above. The left panel                        • Major version releases (e.g. 2 versus 3) are not
deals with setting up the input file, the right panel han-                   backwards-compatible, and have changes to the API
dles the output and the central panel is for setting con-                    (including removal of deprecated classes and
version options. Depending on whether a particular                           functions)
option requires a parameter, the available options are
displayed either as check boxes or as text entry boxes.                   Figure 7 shows an example C++ program that uses the
These interface elements are generated dynamically                      two main classes OBConversion and OBMol to print
directly from the text description and help text provided               out the molecular weight of all of the molecules in an
by each format plugin.                                                  SDF file. This could be used, for example, to investigate
                                                                        differences in the molecular weight distribution between
Programming Library                                                     two databases. The same program is shown in Figure 8
The Open Babel library allows users to write chemistry                  but implemented using the Python bindings.
applications without worrying about the low-level details
of handling chemical information, such as how to read                   Examples of Use
or write a particular file format, or how to use SMARTS                 Open Babel has already been referenced over 400 times
for substructure searching. Instead, the user can focus                 for various uses. The most common use of Open Babel
on the scientific problem at hand, or on creating a more                is through the obabel command line application (or the
easy-to-use interface (e.g. a GUI) to some of Open                      corresponding graphical user interface) for the intercon-
Babel’s functionality. The Open Babel API (Application                  version of chemical file formats. Such conversions may
Programming Interface) is the set of classes, methods                   also involve the calculation or inference of additional
and variables provided by Open Babel to the user for                    molecular information or application of a filter. Some




J. Cheminf. 2011, 3, 33.
O’Boyle et al. Journal of Cheminformatics 2011, 3:33                                                                            Page 9 of 14
http://www.jcheminf.com/content/3/1/33




 Figure 6 Screenshot of the Open Babel GUI. In the screenshot, the Open Babel GUI is running on Bio-Linux 6.0, an Ubuntu derivative.


published examples of these include the following:                          •   calculation of partial charges [54,55]
                                                                            •   generation of molecular fingerprints [56-59]
    • interconversion of chemical file formats or repre-                    •   removal of duplicate molecules from a dataset [60]
    sentations [44-47]                                                      •   calculation of MOL2 atom types [61]
    • addition of hydrogens [48-50]
    • generation of 3D molecular structures [51-53]                       An interesting example that shows how a particular
                                                                        chemical representation may be used to facilitate a
                                                                        scientific study is the crystallographic study of Fábián
                                                                        and Brock who used Open Babel to generate InChI
                                                                        strings for molecules in the Cambridge Structural Data-
                                                                        base [62]. Exploiting the fact that InChIs of enantiomers
                                                                        are identical expect at the enantiomer sublayer ("/m0”




 Figure 7 Example C++ program that uses the Open Babel                    Figure 8 Example Python program that uses the Open Babel
 library. The program prints out the molecular weight of each             library. The program prints out the molecular weight of each
 molecule in the SDF file “dataset.sdf”.                                  molecule in the SDF file “dataset.sdf”.




J. Cheminf. 2011, 3, 33.
O’Boyle et al. Journal of Cheminformatics 2011, 3:33                                                                                     Page 10 of 14
http://www.jcheminf.com/content/3/1/33




Table 1 Software applications and libraries that use Open Babel
Name              Description                                                            Reference       Web page
Avogadro          GUI for molecular modelling and computational chemistry              G. Hutchison      http://avogadro.openmolecules.net/
                                                                                       M. Hanwell
cclib             Parse computational chemistry output files                                [72]         http://cclib.sf.net/
CCP1GUI           GUI for computational chemistry                                       Jens Thomas      http://www.cse.scitech.ac.uk/ccg/software/
                                                                                                         ccp1gui
ChemAzTech        Manage a chemical laboratory database                                 Rémy Dernat      http://chemaztech.sf.net/
ChemSpotlight     Chemistry file indexer for MacOSX                                     G. Hutchison     http://chemspotlight.openmolecules.net/
ChemT             GUI for generating combinatorial libraries                             Rui Abreu       http://www.esa.ipb.pt/~ruiabreu/chemt
ChemTool          2D molecular drawing                                                       [73]        http://ruby.chemie.uni-freiburg.de/~martin/
                                                                                                         chemtool
CMDF              Library for handling and preparing multi-scale multi-paradigm              [74]        http://web.mit.edu/mbuehler/www/research/
                  simulations                                                                            CMDF/CMDF.htm
Confab            Systematically generate conformers                                         [36]        http://confab.googlecode.com/
DockoMatic        Automate the preparation and analysis of AutoDock runs                     [75]        http://sf.net/projects/dockomatic/
DOVIS 2.0         Automate the preparation and analysis of AutoDock runs                     [76]        http://www.bhsai.org/dovis.html
FAF-Drugs2        ADMET filtering of molecular datasets                                      [77]        http://www.mti.univ-paris-diderot.fr/fr/
                                                                                                         downloads.html
FMiner2           Large-scale chemical graph mining based on backbone                      [78,79]       http://www.maunz.de/wordpress/bbrc
                  refinement classes
Ghemical          GUI for computational chemistry                                          Tommi         http://www.uku.fi/~thassine/projects/
                                                                                          Hassinen       ghemical
Gnome             2D chemical editor, 3D viewer, chemical calculator and periodic       Jean Bréfort     http://gchemutils.nongnu.org/
Chemistry Utils   table for Linux
iBabel            MacOSX interface to Open Babel and other Open chemistry tools         Chris Swain      http://homepage.mac.com/swain/Sites/
                                                                                                         Macinchem/page65/ibabel3.html
Kalzium           GUI showing information on the periodic table of the elements           Carsten        http://edu.kde.org/kalzium/
                                                                                          Niehaus
Lazar             Lazy Structure-Activity Relationships for toxicity prediction              [80]        http://www.in-silico.de/software/
Molekel           GUI for computational chemistry                                       Ugo Varetto      http://molekel.cscs.ch/
molsKetch         2D chemical editor                                                     Harm van        http://molsketch.sf.net/
                                                                                          Eersel
MyChem            Chemistry extension to the MySQL database                              J. Pansanel     http://mychem.sf.net/
NanoEngineer-     Computer-aided design for the nanoscale                               Nanorex, Inc.    http://nanoengineer-1.net/
1
NanoHive-1        Simulator for the study, experimentation, and development of          Brian Helfrich   http://www.nanohive-1.org/
                  nanotech entities
OpenMD            Open Source molecular dynamics engine                                      [81]        http://openmd.net/
Open3DQSAR        High-throughput                                                          [82,83]       http://www.open3dqsar.org/
                  chemometric analysis of molecular interaction fields
OSRA              Extracts chemical structures from images                                   [84]        http://osra.sf.net/
PgChem            Chemistry extension to the PostgreSQL database                        Ernst-Georg      http://pgfoundry.org/projects/pgchem
                                                                                          Schmidt
Pharao            Pharmacophore discovery and searching                                  Silicos NV      http://www.silicos.be/
Pharmer           Pharmacophore searching                                                    [85]        http://smoothdock.ccbb.pitt.edu/pharmer
Piramid           Shape-based alignment of molecules                                     Silicos NV      http://www.silicos.be/
PyADF             Library for handling and preparing quantum mechanical multi-               [86]        http://www.ipc.kit.edu/cfn-ysg/158.php
                  scale simulations
PyRx              GUI for virtual screening with protein-ligand docking                    Sargis        http://pyrx.scripps.edu/
                                                                                          Dallakyan
QMForge           GUI for analysing results of quantum chemistry calculations                [72]        http://qmforge.sf.net/
RMG               Reaction Mechanism Generator                                               [87]        http://rmg.sf.net/
Sci3D             Interactive visualization of 3D models of scientific data, such as   T.J. O’Donnell    http://sci3d.sf.net/
                  molecular structures and surfaces
Sieve             Filter molecules from datasets                                         Silicos NV      http://www.silicos.be/
SMIREP            Generation of fragment-based structure-activity relationships              [88]        http://www.karwath.org/systems/smirep.html
Stripper          Extract molecular scaffolds                                            Silicos NV      http://www.silicos.be/




J. Cheminf. 2011, 3, 33.
O’Boyle et al. Journal of Cheminformatics 2011, 3:33                                                                                       Page 11 of 14
http://www.jcheminf.com/content/3/1/33




Table 1 Software applications and libraries that use Open Babel (Continued)
Toxtree           Toxic hazard estimation using decision trees                         Ideaconsult     http://toxtree.sf.net/
                                                                                           Ltd.
V_Sim             Visualize atomic structures such as crystals and grain boundaries   Damien Caliste http://inac.cea.fr/L_Sim/V_Sim/index.en.html
WebBabel          Web application for file format conversion                          T.J. O’Donnell   http://webbabel.sf.net/
XDrawChem         2D molecular editor                                                 Bryan Herger     http://xdrawchem.sf.net/
XtalOpt           Extension to Avogadro for crystal-structure prediction                  [89]         http://xtalopt.openmolecules.net/
YASARA            GUI for molecular graphics, modeling and simulation                 Elmar Krieger    http://www.yasara.org/
ZODIAC            GUI for molecular modelling and docking                                  [90]        http://www.zeden.org/


or “/m1”), they used the InChIs as part of a workflow to                         • Langham and Jain developed a model for chemical
identify kryptoracemates (a class of racemic crystals                            mutagenicity based on atom pair features [64].
where the enantiomers are not related by space-group                             • Fontaine et al. implemented a method, anchor-
symmetry) in the database.                                                       GRIND, that uses an anchor point of a molecular
   To implement new methods, or access additional mole-                          scaffold to compare molecular interaction fields
cular information, it is necessary to use the Open Babel                         when different substituents are present [65].
library directly either from C++ or using one of the sup-                        • Konyk et al. have developed a plugin for Open
ported language bindings. Some examples of published                             Babel that adds support for the Web Ontology Lan-
studies that have done this include the following:                               guage (OWL) to allow automated reasoning about
                                                                                 chemical structures [66].
    • Dehmer et al. implemented molecular complexity                             • Kogej et al. (AstraZeneca) implemented a 3-point
    measures based on information theory [63].                                   pharmacophore fingerprint called TRUST [67].

Table 2 Web applications and databases that use Open Babel
Name             Description                                                           Reference       Web page
ChemDB           Database of small molecules                                               [91]        http://cdb.ics.uci.edu/
Cheméo           Chemical structure and property search engine                         Céondo Ltd      http://www.chemeo.com/
ChemMine         Web application for analysing and clustering small molecules              [92]        http://chemmine.ucr.edu/
Tools
eMolecules       Chemical vendor search engine                                         eMolecules.     http://emolecules.com/
                                                                                         com
FragmentStore Database for comparison of fragments found in metabolites, drugs             [93]        http://bioinf-applied.charite.de/
              and toxic compounds                                                                      fragment_store/
Frog2            FRee Online druG 3D conformation generation                               [94]        http://bioserv.rpbs.univ-paris-diderot.fr/cgi-
                                                                                                       bin/Frog2
hBar Lab         Web application providing on-demand access to computer-aided         hBar Solutions https://www.hbar-lab.com/
                 chemistry                                                                 ApS
IUPHAR-DB        Database of human drug targets and their ligands                          [95]        http://www.iuphar-db.org/
OpenCDLig        Web application for sharing resources about cyclodextrin/ligand           [96]        https://kdd.di.unito.it/casmedchem/
                 complexes
PSMDB            Protein - Small-Molecule Database                                         [97]        http://compbio.cs.toronto.edu/psmdb/
SambVca          Web application for calculation of buried volume of organometallic        [98]        https://www.molnac.unisa.it/OMtools/
                 ligands                                                                               sambvca.php
ScafBank         Database of molecular scaffolds                                           [99]        http://202.127.30.184:8080/scafbank.html
SMARTCyp         Web application for prediction of sites of cytochrome P450               [100]        http://www.farma.ku.dk/smartcyp/
                 mediated metabolism
sMol Explorer    Web application for exploring small-molecule datasets                    [101]        http://www3a.biotec.or.th/isl/index.php/smol-
                                                                                                       explorer
SuperImposé      Web application for structural similarity between ligands, binding       [102]        http://farnsworth.charite.de/superimpose-
                 sites or proteins                                                                     web/
SuperToxic       Database of toxic compounds                                              [103]        http://bioinformatics.charite.de/supertoxic/
SuperSite        Detailed information on, and comparisons of, protein-ligand              [104]        http://bioinf-tomcat.charite.de/supersite/
                 binding sites
SuperSweet       Database of natural and artificial sweeteners                            [105]        http://bioinf-applied.charite.de/sweet/
STITCH2          Chemical-protein interactions                                            [106]        http://stitch.embl.de/
VCCLAB           Virtual Computational Chemistry Laboratory                               [107]        http://www.vcclab.org/
wwLigCSRre       Web application that performs ligand-based screening using 3D            [108]        http://bioserv.rpbs.univ-paris-diderot.fr/Help/
                 similarity                                                                            wwLigCSRre.html


J. Cheminf. 2011, 3, 33.
O’Boyle et al. Journal of Cheminformatics 2011, 3:33                                                                               Page 12 of 14
http://www.jcheminf.com/content/3/1/33




    • Many other examples exist [68-71].                         Any restrictions to use by non-academics: None

  The vital role that a cheminformatics toolkit plays in
                                                               Acknowledgements and Funding
the development of scientific resources is shown by            We would like to thank all users and contributors to the Open Babel project
Tables 1 and 2. Table 1 lists examples of stand-alone          over its history, including OpenEye Scientific Software Inc. for their initial
applications or programming libraries that rely on Open        OELib code. We also thank the Blue Obelisk Movement for ideas, comments
                                                               on this manuscript, and support. We thank SourceForge for providing
Babel, either calling the library directly or via one of the   resources for issue tracking and managing releases, and Kitware for
command-line executables. Table 2 contains examples            additional dashboard resources. NMOB is supported by a Health Research
of web applications and databases that either use Open         Board Career Development Fellowship (PD/2009/13).
Babel on the server or where Open Babel was used in            Author details
the preparation of the data.                                   1
                                                                Analytical and Biological Chemistry Research Facility, Cavanagh Pharmacy
                                                               Building, University College Cork, Co. Cork, Ireland. 2Department of
                                                               Chemistry, Technische Universität München, Garching D-85747, Germany.
Conclusions                                                    3
                                                                eMolecules, Inc., 420 Stevens Ave #120, Solana Beach, CA 92075, USA.
In November 2011, Open Babel will mark 10 years of             4
                                                                Open Babel development team. 5University of Pittsburgh, Department of
existence as an independent project, and for the first         Chemistry, 219 Parkman Avenue, Pittsburgh, PA 15217, USA.
time, we have discussed its development and features.          Authors’ contributions
As shown by more than 400 citations, it has become an          GRH is the lead developer of the Open Babel project. CAJ, CM, MB, NMOB,
essential tool for handling the myriad of molecular file       and TV are developers of Open Babel. All authors read and approved the
                                                               final manuscript.
formats encountered in diverse branches of chemistry.
While more work remains to be done, through valida-            Competing interests
tion processes such as those described above and the           The authors declare that they have no competing interests.
recent introduction of a nightly build and testing frame-      Received: 27 June 2011 Accepted: 7 October 2011
work, we aim to improve the quality and robustness of          Published: 7 October 2011
the toolkit with each new release.
  Looking forward to the future, one of the goals of the       References
                                                               1. Weininger D: SMILES, a chemical language and information system. 1.
project is to extend support to molecules that currently           Introduction to methodology and encoding rules. J Chem Inf Comput Sci
are not handled very well by existing cheminformatics              1988, 28:31-36.
toolkits. Typically toolkits focus on the types of mole-       2. Murray-Rust P, Rzepa H: Chemical markup, XML, and the Worldwide Web.
                                                                   1. Basic principles. J Chem Inf Comput Sci 1999, 39:928-942.
cules of principal importance to the pharmaceutical            3. Murray-Rust P, Rzepa HS: Chemical Markup, XML and the World-Wide
industry, namely stable organic molecules comprising               Web. 2. Information Objects and the CMLDOM. J Chem Inf Model 2001,
wholly of 2-center 2-electron covalent bonds. Molecules            41:1113-1123.
                                                               4. Murray-Rust P, Rzepa H, Wright M: Development of chemical markup
outside this set - such as radicals, organometallic and            language (CML) as a system for handling complex chemical content.
inorganic molecules, molecules with coordinate bonds               New J Chem 2001, 25:618-634.
or 3-center 2-electron bonds - are poorly supported in         5. Murray-Rust P, Rzepa H: Chemical Markup, XML, and the World Wide
                                                                   Web. 4. CML Schema. J Chem Inf Comput Sci 2003, 43:757-772.
general. Future releases of Open Babel will provide sub-       6. Holliday GL, Murray-Rust P, Rzepa HS: Chemical Markup, XML, and the
stantially improved handling of such species. We also              World Wide Web. 6. CMLReact, an XML Vocabulary for Chemical
seek to improve speed and coverage of important meth-              Reactions. J Chem Inf Model 2006, 46:145-157.
                                                               7. Daylight Theory: :, SMARTS http://www.daylight.com/dayhtml/doc/theory/
ods such as structure generation, kekulization and                 theory.smarts.html.
canonicalization.                                              8. Fogel K: Producing Open Source Software: How to Run a Successful Free
  Open Babel is freely available from http://openbabel.            Software Project O’Reilly Media, Inc. Sebastopol, CA; 2005.
                                                               9. Citations were generated by Google Scholar:[http://scholar.google.com/
org, and new community members are very welcome                    scholar?
(users, developers, bug reporters, feature requesters). For        as_q=openbabel&num=10&as_occt=any&as_publication=&as_ylo=2001].
information on how to use Open Babel, please see the           10. A selection of such projects is included below. :, The full list is available at:
                                                                   http://openbabel.org/wiki/Related_Projects.
documentation at http://openbabel.org/docs and the API         11. Open Babel: :[http://openbabel.org/].
documentation at http://openbabel.org/api.                     12. Open Babel Report Format: :[http://openbabel.org/docs/2.3.0/FileFormats/
                                                                   Open_Babel_report_format.html].
                                                               13. Open Babel Fingerprint Format: :[http://openbabel.org/docs/2.3.0/
Availability and Requirements                                      FileFormats/Fingerprint_format.html].
Project Name: Open Babel                                       14. Open Babel Fastsearch Format: :[http://openbabel.org/docs/2.3.0/
  Project home page: http://openbabel.org                          FileFormats/Fastsearch_format.html].
                                                               15. MolPrint2D Format: :[http://openbabel.org/docs/2.3.0/FileFormats/
  Operating system(s): Cross-platform                              MolPrint2D_format.html].
  Programming language: C++, bindings to Python,               16. Bender A, Mussa HY, Glen RC, Reiling S: Molecular Similarity Searching
Perl, Ruby, Java, C#                                               Using Atom Environments, Information-Based Feature Selection, and a
                                                                   Naïve Bayesian Classifier. J Chem Inf Model 2004, 44:170-178.
  Other requirements (if compiling): CMake 2.4+                17. MNA Format: :[http://openbabel.org/docs/2.3.0/FileFormats/
  License: GNU GPL v2                                              Multilevel_Neighborhoods_of_Atoms_(MNA).html].




J. Cheminf. 2011, 3, 33.
O’Boyle et al. Journal of Cheminformatics 2011, 3:33                                                                                                Page 13 of 14
http://www.jcheminf.com/content/3/1/33




18. Filimonov D, Poroikov V, Borodina Y, Gloriozova T: Chemical Similarity         47. Arbor S, Marshall GR: A virtual library of constrained cyclic tetrapeptides
    Assessment through Multilevel Neighborhoods of Atoms: Definition and               that mimics all four side-chain orientations for over half the reverse
    Comparison with the Other Descriptors. J Chem Inf Model 1999,                      turns in the protein data bank. J Comput-Aided Mol Des 2008, 23:87-95.
    39:666-670.                                                                    48. Huang Z, Wong CF: A Mining Minima Approach to Exploring the
19. PDB Format v3.2: :[http://www.wwpdb.org/documentation/format32/v3.2.               Docking Pathways of p-Nitrocatechol Sulfate to YopH. Biophys J 2007,
    html].                                                                             93:4141-4150.
20. PDB: Cruft to Content: :[http://www.daylight.com/meetings/mug01/Sayle/         49. Hill AD, Reilly PJ: A Gibbs free energy correlation for automated docking
    m4xbondage.html].                                                                  of carbohydrates. J Comput Chem 2008, 29:1131-1141.
21. Morgan HL: The Generation of a Unique Machine Description for                  50. Armen RS, Chen J, Brooks CL III: An Evaluation of Explicit Receptor
    Chemical Structures-A Technique Developed at Chemical Abstracts                    Flexibility in Molecular Docking Using Molecular Dynamics and Torsion
    Service. J Chem Docum 1965, 5:107-113.                                             Angle Molecular Dynamics. J Chem Theory Comp 2009, 5:2909-2923.
22. Nauty: :[http://cs.anu.edu.au/~bdm/nauty/].                                    51. Liu L, Ma H, Yang N, Tang Y, Guo J, Tao W, Jaa Duan: A Series of Natural
23. McKay BD: Practical graph isomorphism. Congressus Numerantium 1981,                Flavonoids as Thrombin Inhibitors: Structure-activity relationships.
    30:45-87.                                                                          Thromb Res 2010, 126:e365-e378.
24. Gakh A, Burnett M: Modular Chemical Descriptor Language (MCDL):                52. Wallach I, Jaitly N, Lilien R: A Structure-Based Approach for Mapping
    Composition, connectivity, and supplementary modules. J Chem Inf                   Adverse Drug Reactions to the Perturbation of Underlying Biological
    Comput Sci 2001, 41:1494-1499.                                                     Pathways. PLoS One 2010, 5:e12063.
25. Trepalin SV, Yarkov AV, Pletnev IV, Gakh AA: A Java Chemical Structure         53. Paila YD, Tiwari S, Sengupta D, Chattopadhyay A: Molecular modeling of
    Editor Supporting the Modular Chemical Descriptor Language (MCDL).                 the human serotonin1A receptor: role of membrane cholesterol in
    Molecules 2006, 11:219-231.                                                        ligand binding of the receptor. Molecular BioSystems 2011, 7:224-234.
26. Gakh AA, Burnett MN, Trepalin SV, Yarkov AV: Modular Chemical                  54. Melville JL, Hirst JD: TMACC: Interpretable Correlation Descriptors for
    Descriptor Language (MCDL): Stereochemical modules. J Cheminf 2011,                Quantitative Structure−Activity Relationships. J Chem Inf Model 2007,
    3:5.                                                                               47:626-634.
27. Halgren T: Merck molecular force field .1. Basis, form, scope,                 55. Pencheva T, Lagorce D, Pajeva I, Villoutreix BO, Miteva MA: AMMOS:
    parameterization, and performance of MMFF94. J Comput Chem 1996,                   Automated Molecular Mechanics Optimization tool for in silico
    17:490-519.                                                                        Screening. BMC Bioinformatics 2008, 9:438.
28. Halgren T: Merck molecular force field .2. MMFF94 van der Waals and            56. Schietgat L, Ramon J, Bruynooghe M: An Efficiently Computable Graph-
    electrostatic parameters for intermolecular interactions. J Comput Chem            Based Metric for the Classification of Small Molecules. Proceedings of the
    1996, 17:520-552.                                                                  11th International Conference on Discovery Science Springer-Verlag Berlin,
29. Halgren T: Merck molecular force field .3. Molecular geometries and                Heidelberg; 2008, 197-209.
    vibrational frequencies for MMFF94. J Comput Chem 1996, 17:553-586.            57. Krier M, Hutter MC: Bioisosteric Similarity of Molecules Based on
30. Halgren T, Nachbar R: Merck molecular force field .4. Conformational               Structural Alignment and Observed Chemical Replacements in Drugs. J
    energies and geometries for MMFF94. J Comput Chem 1996, 17:587-615.                Chem Inf Model 2009, 49:1280-1297.
31. Halgren T: Merck molecular force field .5. Extension of MMFF94 using           58. Wang X, Huan J, Smalter A, Lushington GH: Application of kernel
    experimental data, additional computational data, and empirical rules. J           functions for accurate similarity search in large chemical databases. BMC
    Comput Chem 1996, 17:616-641.                                                      Bioinformatics 2010, 11:S8.
32. Andronico A, Randall A, Benz RW, Baldi P: Data-driven high-throughput          59. Cheng T, Li Q, Wang Y, Bryant SH: Binary Classification of Aqueous
    prediction of the 3-D structure of small molecules: review and progress.           Solubility Using Support Vector Machines with Reduction and
    J Chem Inf Model 2011, 51:760-776.                                                 Recombination Feature Selection. J Chem Inf Model 2011, 51:229-236.
33. Rappe A, Casewit C, Colwell K, Goddard W III, Skiff WM: UFF, a full periodic   60. Mihaleva VV, Verhoeven HA, de Vos RCH, Hall RD, van Ham RCHJ:
    table force field for molecular mechanics and molecular dynamics                   Automated procedure for candidate compound selection in GC-MS
    simulations. J Am Chem Soc 1992, 114:10024-10035.                                  metabolomics based on prediction of Kovats retention index.
34. Wang J, Wolf RM, Caldwell JW, Kollman PA, Case DA: Development and                 Bioinformatics 2009, 25:787-794.
    testing of a general amber force field. J Comput Chem 2004,                    61. Bas DC, Rogers DM, Jensen JH: Very fast prediction and rationalization of
    25:1157-1174.                                                                      pKa values for protein-ligand complexes. Proteins: Struct, Funct, Bioinf
35. Wang J, Wang W, Kollman PA, Case DA: Automatic atom type and bond                  2008, 73:765-783.
    type perception in molecular mechanical calculations. J Molec Graph            62. Fabian L, Brock CP: A list of organic kryptoracemates. Acta Cryst 2010,
    Model 2006, 25:247-260.                                                            B66:94-103.
36. O’Boyle NM, Vandermeersch T, Flynn CJ, Maguire AR, Hutchison GR: Confab        63. Dehmer M, Barbarini N, Varmuza K, Graber A: A Large Scale Analysis of
    - Systematic generation of diverse low-energy conformers. J Cheminf                Information-Theoretic Network Complexity Measures Using Chemical
    2011, 3:8.                                                                         Structures. PLoS One 2009, 4:e8057.
37. CMake: :[http://www.cmake.org/].                                               64. Langham JJ, Jain AN: Accurate and Interpretable Computational
38. Martin K, Hoffman B: Mastering CMake: A Cross-Platform Build System.               Modeling of Chemical Mutagenicity. J Chem Inf Model 2008, 48:1833-1839.
    Kitware, Inc., Clifton Park, NY;, 5 2010.                                      65. Fontaine F, Pastor M, Zamora I: Anchor-GRIND: Filling the gap between
39. CDash Dashboard for Open Babel: :[http://my.cdash.org/index.php?                   standard 3D QSAR and the GRid-INdependent Descriptors. J Med Chem
    project=Open+Babel].                                                               2005, 48(7):2687-94.
40. O’Boyle N, Morley C, Hutchison GR: Pybel: a Python wrapper for the             66. Konyk M, De Leon A, Dumontier M: Chemical knowledge for the semantic
    OpenBabel cheminformatics toolkit. Chem Cent J 2008, 2:5.                          web. Data Integration in the Life Sciences Springer-Verlag Berlin, Heidelberg;
41. Open Babel Bug Tracker: :[https://sourceforge.net/tracker/?                        2008, 169-176.
    limit=25&func=&group_id=40728&atid=428740&status=2].                           67. Kogej T, Engkvist O, Blomberg N, Muresan S: Multifingerprint Based
42. Doxygen: :[http://www.doxygen.org/].                                               Similarity Searches for Targeted Class Compound Selection. J Chem Inf
43. Open Babel API: :[http://openbabel.org/api].                                       Model 2006, 46:1201-1213.
44. Myers J, Allison T, Bittner S, Didier B, Frenklach M, Green W, Ho Y,           68. Reynès C, Host H, Camproux A-C, Laconde G, Leroux F, Mazars A, Deprez B,
    Hewson J, Koegler W, Lansing C, et al: A collaborative informatics                 Fahraeus R, Villoutreix BO, Sperandio O: Designing Focused Chemical
    infrastructure for multi-scale science. Cluster Computing 2005, 8:243-253.         Libraries Enriched in Protein-Protein Interaction Inhibitors using
45. Lind P, Alm M: A Database-Centric Virtual Chemistry System. J Chem Inf             Machine-Learning Methods. PLoS Computational Biology 2010, 6:e1000695.
    Model 2006, 46:1034-1039.                                                      69. Lagorce D, Pencheva T, Villoutreix BO, Miteva MA: DG-AMMOS: A New tool
46. Amini A, Shrimpton PJ, Muggleton SH, Sternberg MJE: A general approach             to generate 3D conformation of small molecules using Distance
    for developing system-specific functions to score protein-ligand docked            Geometry and Automated Molecular Mechanics Optimization for in
    complexes using support vector inductive logic programming. Proteins:              silico Screening. BMC Chemical Biology 2009, 9:6.
    Struct, Funct, Bioinf 2007, 69:823-831.




J. Cheminf. 2011, 3, 33.
O’Boyle et al. Journal of Cheminformatics 2011, 3:33                                                                                              Page 14 of 14
http://www.jcheminf.com/content/3/1/33




70. Gómez MJ, Pazos F, Guijarro FJ, de Lorenzo V, Valencia A: The                  94. Miteva MA, Guyon F, Tuffery P: Frog2: Efficient 3D conformation
    environmental fate of organic pollutants through the global microbial               ensemble generator for small compounds. Nucleic Acids Res 2010, 38:
    metabolism. Molecular Systems Biology 2007, 3:114.                                  W622-W627.
71. Kazius J, Nijssen S, Kok J, Bäck T, IJzerman AP: Substructure Mining Using     95. Sharman JL, Mpamhanga CP, Spedding M, Germain P, Staels B, Dacquet C,
    Elaborate Chemical Representation. J Chem Inf Model 2006, 46:597-605.               Laudet V, Harmar AJ, NC-IUPHAR: IUPHAR-DB: new receptors and tools for
72. O’Boyle NM, Tenderholt AL, Langner KM: cclib: A library for package-                easy searching and visualization of pharmacological data. Nucleic Acids
    independent computational chemistry algorithms. J Comput Chem 2008,                 Res 2010, 39:D534-D538.
    29:839-845.                                                                    96. Esposito R, Ermondi G, Caron G: OpenCDLig: a free web application for
73. Brüstle M: Chemtool - Moleküle zeichnen mit dem Pinguin. Nachrichten                sharing resources about cyclodextrin/ligand complexes. J Comput-Aided
    aus der Chemie 2001, 49:1310-1313.                                                  Mol Des 2009, 23:669-675.
74. Buehler M, Dodson J, van Duin A: The Computational Materials Design            97. Wallach I, Lilien R: The protein-small-molecule database, a non-redundant
    Facility (CMDF): A powerful framework for multi-paradigm multi-scale                structural resource for the analysis of protein-ligand binding.
    simulations. Materials Research Society symposium proceedings 2006, 894:            Bioinformatics 2009, 25:615-620.
    LL3.8.                                                                         98. Poater A, Cosenza B, Correa A, Giudice S, Ragone F, Scarano V, Cavallo L:
75. Bullock CW, Jacob RB, McDougal OM, Hampikian G, Andersen T:                         Samb Vca: A Web Application for the Calculation of the Buried Volume
    Dockomatic - automated ligand creation and docking. BMC Research                    of N-Heterocyclic Carbene Ligands. Eur J Inorg Chem 2009,
    Notes 2010, 3:289.                                                                  2009:1759-1766.
76. Jiang X, Kumar K, Hu X, Wallqvist A, Reifman J: DOVIS 2.0: an efficient and    99. Yan B-b, Xue M-z, Xiong B, Liu K, Hu D-y, Shen J-k: ScafBank: a public
    easy to use parallel virtual screening tool based on AutoDock 4.0. Chem             comprehensive Scaffold database to support molecular hopping. Acta
    Cent J 2008, 2:18.                                                                  Pharmacologica Sinica 2009, 30:251-258.
77. Lagorce D, Sperandio O, Galons H, Miteva MA, Villoutreix BO: FAF-Drugs2:       100. Rydberg P, Gloriam DE, Olsen L: The SMARTCyp cytochrome P450
    Free ADME/tox filtering tool to assist drug discovery and chemical                  metabolism prediction server. Bioinformatics 2010, 26:2988-2989.
    biology projects. BMC Bioinformatics 2008, 9:396.                              101. Ingsriswang S, Pacharawongsakda E: sMOL Explorer: an open source, web-
78. Maunz A, Helma C, Kramer S: Efficient mining for structurally diverse               enabled database and exploration tool for Small MOLecules datasets.
    subgraph patterns in large molecular databases. Machine Learning 2010,              Bioinformatics 2007, 23:2498-2500.
    83:193-218.                                                                    102. Bauer RA, Bourne PE, Formella A, Frommel C, Gille C, Goede A, Guerler A,
79. Maunz A, Helma C, Kramer S: Large-scale graph mining using backbone                 Hoppe A, Knapp EW, Poschel T, et al: Superimpose: a 3D structural
    refinement classes. Proceedings of the 15th ACM SIGKDD International                superposition server. Nucleic Acids Res 2008, 36:W47-W54.
    Conference on Knowledge Discovery and Data Mining (KDD 2009) ACM Paris;        103. Schmidt U, Struck S, Gruening B, Hossbach J, Jaeger IS, Parol R,
    2009, 617-626.                                                                      Lindequist U, Teuscher E, Preissner R: SuperToxic: a comprehensive
80. Helma C: Lazy structure-activity relationships (lazar) for the prediction of        database of toxic compounds. Nucleic Acids Res 2009, 37:D295-D299.
    rodent carcinogenicity and Salmonella mutagenicity. Mol Diversity 2006,        104. Bauer RA, Gunther S, Jansen D, Heeger C, Thaben PF, Preissner R: SuperSite:
    10:147-158.                                                                         dictionary of metabolite and drug binding sites in proteins. Nucleic Acids
81. Meineke MA, Vardeman CF, Lin T, Fennell CJ, Gezelter JD: OOPSE: an                  Res 2009, 37:D195-D200.
    object-oriented parallel simulation engine for molecular dynamics. J           105. Ahmed J, Preissner S, Dunkel M, Worth CL, Eckert A, Preissner R:
    Comput Chem 2005, 26:252-271.                                                       SuperSweet–a resource on natural and artificial sweetening agents.
82. Tosco P, Balle T: Brute-force pharmacophore assessment and scoring                  Nucleic Acids Res 2010, 39:D377-D382.
    with Open3DQSAR. J Cheminf 2011, 3(Suppl 1):P39.                               106. Kuhn M, Szklarczyk D, Franceschini A, Campillos M, von Mering C,
83. Tosco P, Balle T: Open3DQSAR: a new open-source software aimed at                   Jensen LJ, Beyer A, Bork P: STITCH 2: an interaction network database for
    high-throughput chemometric analysis of molecular interaction fields. J             small molecules and proteins. Nucleic Acids Res 2009, 38:D552-D556.
    Mol Model 2011, 17:201-208.                                                    107. Tetko IV, Gasteiger J, Todeschini R, Mauri A, Livingstone D, Ertl P,
84. Filippov IV, Nicklaus MC: Optical Structure Recognition Software To                 Palyulin VA, Radchenko EV, Zefirov NS, Makarenko AS, et al: Virtual
    Recover Chemical Information: OSRA, An Open Source Solution. J Chem                 Computational Chemistry Laboratory - Design and Description. J
    Inf Model 2009, 49:740-743.                                                         Comput-Aided Mol Des 2005, 19:453-463.
85. Koes DR, Camacho CJ: Pharmer: Efficient and Exact Pharmacophore                108. Sperandio O, Petitjean M, Tuffery P: wwLigCSRre: a 3D ligand-based server
    Search. J Chem Inf Model 2011, 51(6):1307-14.                                       for hit identification and optimization. Nucleic Acids Res 2009, 37:
86. Jacob CR, Beyhan SM, Bulo RE, Gomes ASP, Götz AW, Kiewisch K, Sikkema J,            W504-W509.
    Visscher L: PyADF - A scripting framework for multiscale quantum
    chemistry. J Comput Chem 2011, 32:2328-2338.                                     doi:10.1186/1758-2946-3-33
87. Green HWilliam, Allen WJoshua, Ashcraft WRobert, Beran JGregory,                 Cite this article as: O’Boyle et al.: Open Babel: An open chemical
    Class ACaleb, Gao Connie, Franklin Goldsmith C, Harper RMichael,                 toolbox. Journal of Cheminformatics 2011 3:33.
    Jalan Amrit, Magoon RGregory, Matheu MDavid, Merchant SShamel,
    Mo DJeffrey, Petway Sarah, Raman Sumathy, Sharma Sandeep, Song Jing,
    Van Geem MKevin, Wen John, West HRichard, Wong Andrew, Wong Hsi-
    Wu, Yelvington EPaul, Yu Joanna: RMG - Reaction Mechanism Generator
    v3.3. 2011 [http://rmg.sourceforge.net/].
88. Karwath A, De Raedt L: SMIREP: Predicting Chemical Activity from SMILES.
    J Chem Inf Model 2006, 46:2432-2444.
89. Lonie DC, Zurek E: XTALOPT: An open-source evolutionary algorithm for
                                                                                   Publish with ChemistryCentral and every
    crystal structure prediction. Comput Phys Commun 2011, 182:372-387.            scientist can read your work free of charge
90. Zonta N, Grimstead IJ, Avis NJ, Brancale A: Accessible haptic technology
    for drug design applications. J Mol Model 2008, 15:193-196.
                                                                                             Open access provides opportunities to our
91. Chen JH, Linstead E, Swamidass SJ, Wang D, Baldi P: ChemDB update full-              colleagues in other parts of the globe, by allowing
    text search and virtual chemical space. Bioinformatics 2007, 23:2348-2351.               anyone to view the content free of charge.
92. Backman TWH, Cao Y, Girke T: ChemMine tools: an online service for                                          W. Jeffery Hurst, The Hershey Company.
    analyzing and clustering small molecules. Nucleic Acids Res 2011, 39(Web
    Server issue):W486-91.                                                           available free of charge to the entire scientific community
93. Ahmed J, Worth CL, Thaben P, Matzig C, Blasse C, Dunkel M, Preissner R:          peer reviewed and published immediately upon acceptance
    FragmentStore–a comprehensive database of fragments linking                      cited in PubMed and archived on PubMed Central
    metabolites, toxic molecules and drugs. Nucleic Acids Res 2010, 39:
    D1049-D1054.
                                                                                     yours you keep the copyright
                                                                                   Submit your manuscript here:
                                                                                   http://www.chemistrycentral.com/manuscript/




J. Cheminf. 2011, 3, 33.
Part II

Enzyme reaction mechanisms




            39
My Open Access papers
Vol. 21 no. 23 2005, pages 4315–4316
 BIOINFORMATICSAPPLICATIONS NOTE                                                                                               doi:10.1093/bioinformatics/bti693



 Databases and ontologies

 MACiE: a database of enzyme reaction mechanisms
                                                                    ,†
 Gemma L. Holliday1, Gail J. Bartlett2 , Daniel E. Almonacid1, Noel M. O’Boyle1,
 Peter Murray-Rust1, Janet M. Thornton2 and John B. O. Mitchell1,Ã
 1
  Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge,
 Lensfield Road, Cambridge, CB2 1EW, UK and 2EMBL-EBI, Wellcome Trust Genome Campus,
 Hinxton, Cambridge, CB10 1SD, UK
 Received on July 21, 2005; revised on September 22, 2005; accepted on September 23, 2005
 Advance Access publication September 27, 2005




 ABSTRACT                                                                             DESIGN




                                                                                                                                                                          Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on October 22, 2011
 Summary: MACiE (mechanism, annotation and classification in                          The MACiE dataset evolved from that published in the Catalytic
 enzymes) is a publicly available web-based database, held in                         Site Atlas (CSA) (Bartlett et al., 2002; Porter et al., 2004), and each
 CMLReact (an XML application), that aims to help our understanding                   entry is selected so that it fulfils the following criteria:
 of the evolution of enzyme catalytic mechanisms and also to create a
 classification system which reflects the actual chemical mechanism                      (1) There is a 3D crystal structure of the enzyme deposited in the
 (catalytic steps) of an enzyme reaction, not only the overall reaction.                     Protein Databank (PDB) (Berman et al., 2000).
 Availability: http://www-mitchell.ch.cam.ac.uk/macie/                                   (2) There is a relatively well-understood mechanism available.
 Contact: jbom1@cam.ac.uk                                                                    Taken from the literature, these cover a variety of
                                                                                             methodologies, including chemical and biochemical studies,
 A great deal of knowledge about enzymes, including structures,                              quantum mechanical calculations and structual biology
 gene sequences, mechanisms, metabolic pathways and kinetic                                  reports.
 data, now exists. However, it is spread between many different                          (3) The enzyme is unique at the H level of the CATH
 databases and throughout the literature. Here we announce the                               classification—a hierarchical classification system of
 completion of the initial version of MACiE, a unique database of                            protein domain structures (Orengo et al., 1997)—unless
 the chemical mechanisms of enzymatic reactions.                                             there is a homologue with a significantly different chemical
    Web resources such as BRENDA (Schomburg et al., 2004),                                   mechanism.
 KEGG (Kanehisa et al., 2004) and the International Union of Bio-                        (4) Where there are a number of possible PDB codes available
 chemistry and Molecular Biology (IUBMB) Enzyme Nomenclature                                 the entry should be, if possible, a wild-type enzyme.
 website (IUBMB, 2005, http://www.chem.qmul.ac.uk/iubmb/
 enzyme/) contain descriptions of the overall reactions performed                        All MACiE enzymes are also contained in the Enzyme Commis-
 by enzymes, accompanied in some cases by a textual or graphical                      sion (EC) classification system (IUBMB, 2005, http://www.chem.
 description of the mechansim. MACiE is unique in combining                           qmul.ac.uk/iubmb/enzyme/), that is, they all have four number codes
 detailed stepwise mechanistic information (including 2D anima-                       describing their overall reaction. The first level (Class) describes
 tions), a wide coverage of both chemical space and the protein                       the basic reaction type. The second and third levels (subclass and
 structure universe, and the chemical intelligence of CMLReact                        sub-subclass, respectively) describe the reaction in further detail
 (Holliday,C.L., Murray-Rust,P., and Rzepa,H.S., 2005, manuscript                     and the final level (serial number) describes substrate specificity.
 submitted to J. Chem. Inf. Modeling). MACiE usefully complements                     For example, the b-lactamases (Fig. 1) are assigned the EC number
 both the mechanistic detail of the Structure–Function Linkage                        3.5.2.6, i.e. a hydrolase (3) acting on a C–N bond (5) in a cyclic
 Database (SFLD) for a small number of enzyme superfamilies                           amide (2) with a b-lactam as the substrate (6).
 (Pegg et al., 2005) and the wider coverage with less chemical                           In MACiE, the data centre on the catalytic steps involved in the
 detail provided by EzCatDB (Nagano, 2005) which also contains                        chemical mechanism as well as the overall reaction. Each entry
 a limited number of 3D animations.                                                   includes the following steps:
                                                                                          Enzyme name and EC number
 Ã
   To whom correspondence should be addressed.                                            PDB code and CATH codes of all domains in the enzyme
 †
  Present Address: Bioinformatics Support Service (Biochemistry Building),                Diagram and annotation of the overall reaction
 Centre for Bioinformatics, Division of Molecular Biosciences, Faculty of
 Life Sciences, Imperial College London, London, SW7 2AZ, UK                              Primary literature references


 Ó The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

 The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access
 version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University
 Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its
 entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org


Bioinformatics. 2005, 21, 4315-4316.
G.L.Holliday et al.



                  R               R'
                                                  R                 R'         CURATION
                                                                               The annotation process involves input and validation steps. Terms
          H                                                                    have been rigorously defined either from the IUPAC Gold Book
H             +
      O                                                       N
                                                                         R''
                                                                               (McNaught et al., 1997), such as chemical terms like hydrolysis, or
                              N
                                                  O      O
                                                              H                from primary literature, such as mechanism, which is defined using
                  O               R''
                                                         H                     Ingold’s terminology (Ingold, 1969), originally put forward in the
                                                                               1930s. All of the technical and scientific terms used in MACiE are
Fig. 1. The overall reaction for a b-lactamase.                                contained in the MACiE dictionary, which is available at the URL
                                                                               http://www-mitchell.ch.cam.ac.uk/macie/glossary.html and is also
     Diagram and annotation of all reaction steps, including:                 available as a raw XML file.
      —The Ingold mechanism (Ingold, 1969)                                        The entries online are accessed via an HTML look-up table and
      —Diagram and function of catalytic amino acid residues                   include all of the information available in the database. The original
      —Information on the reactive centres and bond changes                    ISIS/Base format file and the raw CML files can be supplied.
     Comments on the reaction (where applicable).

CONTENT                                                                        FUTURE WORK




                                                                                                                                                                       Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on October 22, 2011
The criteria defined in the Design section initially produced a                 Future work includes expanding the dataset to include a representat-
dataset of 100 entries. A single EC number may cover a plurality               ive set of EC numbers (at the sub-subclass level), creating a search
of MACiE entries when different mechanisms bring about the                     interface for MACiE and developing authoring tools for MACiE
same overall chemical transformation, as with the two types of                 in CML. Ongoing research focuses on the evolution of enzyme
3-dehydroquinate dehydratase, and thus 100 MACiE entries span                  catalysis and the classification of enzyme reaction mechanisms.
only 96 EC numbers.
   The 100 enzymes in Version 1 of MACiE incorporate domains
from 140 CATH homologous superfamilies. MACiE currently cov-                   ACKNOWLEDGEMENTS
ers 56 of the 174 EC sub-subclasses present in the PDB, thus, we               G.J.B. would like to thank Dr Jonathan Goodman for his invaluable
feel that we have a representative coverage of EC reaction space               help with organic chemistry queries. We would also like to
(comparative EC wheels are available at URL http://www-mitchell.               thank the EPSRC (G.L.H. and J.B.O.M.), the BBSRC (G.J.B. and
ch.cam.ac.uk/macie/ECCoverage/). We anticipate that all 158 sub-               J.M.T.—CASE studentship in association with Roche Products
subclasses for which both structures and reliable mechanisms are               Ltd; N.M.O.B. and J.B.O.M.—grant BB/C51320X/1), the Chilean
available will be represented in the forthcoming MACiE Version 2.                                                       ´
                                                                               Government’s Ministerio de Planificacion y Cooperacion and´
                                                                               Cambridge Overseas Trust (D.E.A.) for funding and Unilever for
SOFTWARE                                                                       supporting the Centre for Molecular Science Informatics.
The data are initially entered in MDL’s ISIS/Base, a database pack-            Conflict of Interest: none declared.
age for chemical reactions, validated by at least two people, and
then converted into CMLReact using the Jumbo Toolkit (Wakelin
et al., 2005) to create an information and semantically rich database.
                                                                               REFERENCES
At this stage we add extra fields of information to the CMLReact
version of MACiE that are unavailable in the ISIS version,                     Bartlett,G.J. et al. (2002) Analysis of catalytic residues in enzyme active sites.
                                                                                   J. Mol. Biol., 324, 105–121.
including the CATH code. Jumbo is a set of Java-based software
                                                                               Berman,H.M. et al. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235–242.
which converts the MDL file format produced from ISIS/Base into                 Holliday,G.L. et al. (2004) CMLSnap: animated reaction mechanisms. Internet
CMLReact. The MacieConverter section of Jumbo performs the                         J. Chem., 7, Article 4.
following functions:                                                           Ingold,C.K. (1969) Structure and Mechanism in Organic Chemistry. 2nd edn, Cornell
                                                                                   University Press, Ithaca, NY, Chapters 5–15.
     Integration of the files in the ISIS/Base version of MACiE               Kanehisa,M. et al. (2004) The KEGG resource for deciphering the genome. Nucleic
                                                                                   Acids Res., 32, D277–D280.
     Identification of reactant, product and spectator molecules              McNaught,A.D. and Wilkinson,A. (1997) International Union of Pure and Applied
     Splitting of groups of molecules                                             Chemistry Compendium of Chemical Terminology (‘‘The Gold Book’’). 2nd edn,
                                                                                   ISBN 0-8-654-26848.
     Automatic mapping of atoms within the reaction                           Nagano,N. (2005) EzCatDB: the Enzyme Catalytic-mechanism Database. Nucleic
     Checking for mass and charge conservation throughout the reac-               Acids Res., 33, D407–D412.
                                                                               Orengo,C.A. et al. (1997) CATH—a hierarchic classification of protein domain
      tion (stoichiometry)                                                         structures. Structure, 5, 1093–1108.
     Integration and checking of MACiE Dictionary entries.                    Pegg,S.C-H. et al. (2005) Representing structure-function relationships in mechanis-
                                                                                   tically diverse enzyme superfamilies. Pac. Symp. Biocomput., 358–369.
Once the conversion process has been completed, a further tool in              Porter,C.T. et al. (2004) The Catalytic Site Atlas: a resource of catalytic sites and
the Jumbo Toolkit, called CMLSnap (Holliday et al., 2004), can                     residues identified in enzymes using structural data. Nucleic Acids Res., 32,
be used to create an animation of the reaction. This animation                     D129–D133.
                                                                               Schomburg,I. et al. (2004) BRENDA, the enzyme database: updates and major new
includes all of the atoms and bonds involved as well as the electron               developments. Nucleic Acids Res., 32, D431–D433.
movements, which are calculated automatically. It is expected that             Wakelin,J. et al. (2005) CML tools and information flow in atomic scale simulations.
CML will become our primary method of data entry and storage.                      Mol. Simul., 31, 315–322.



4316


    Bioinformatics. 2005, 21, 4315-4316.
Published online 1 November 2006                                     Nucleic Acids Research, 2007, Vol. 35, Database issue D515–D520
                                                                                                                  doi:10.1093/nar/gkl774


 MACiE (Mechanism, Annotation and Classification
 in Enzymes): novel tools for searching catalytic
 mechanisms
 Gemma L. Holliday*, Daniel E. Almonacid1, Gail J. Bartlett, Noel M. O’Boyle1,
 James W. Torrance, Peter Murray-Rust1, John B. O. Mitchell1 and Janet M. Thornton

 EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and 1Unilever Centre for
 Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road,
 Cambridge CB2 1EW, UK

 Received August 4, 2006; Revised September 18, 2006; Accepted October 1, 2006




                                                                                                                                                               Downloaded from http://nar.oxfordjournals.org/ by guest on October 22, 2011
 ABSTRACT                                                                        data, it tends to be spread between many different databases
                                                                                 and throughout the literature. Most web resources relating to
 MACiE (Mechanism, Annotation and Classification in                              enzymes [such as BRENDA (1), KEGG (2), the IUBMB
 Enzymes) is a database of enzyme reaction mecha-                                Enzyme Nomenclature website (http://www.chem.qmul.ac.
 nisms, and is publicly available as a web-based data                            uk/iubmb/enzyme/) (3) and IntEnz (4)] focus on the overall
 resource. This paper presents the first release of a                            reaction, accompanied in some cases by a textual or graphical
 web-based search tool to explore enzyme reaction                                description of the mechanism. However, this does not allow
 mechanisms in MACiE. We also present Version 2 of                               for detailed in silico searching of the chemical steps which
 MACiE, which doubles the dataset available (from                                take place in the reaction. MACiE (5) combines detailed
 Version 1). MACiE can be accessed from http://www.                              stepwise mechanistic information [including 2-D animations
 ebi.ac.uk/thornton-srv/databases/MACiE/                                         (6)], a wide coverage of both chemical space and the protein
                                                                                 structure universe, and the chemical intelligence of the
                                                                                 Chemical Markup Language for Reactions (CMLReact) (7).
                                                                                 This usefully complements both the mechanistic detail of
 INTRODUCTION                                                                    the Structure–Function Linkage Database (SFLD) for a
 Enzymes are proteins that catalyse the repertoire of chemical                   small number of rather ‘promiscuous’ enzyme superfamilies
 reactions found in nature, and as such are vitally important                    (8) and the wider coverage with less chemical detail provided
 molecules. What is so fascinating about these proteins is                       by EzCatDB (9), which also contains a limited number of 3D
 that they have a wonderful diversity and can carry out highly                   animations. Entries in MACiE are linked, where appropriate,
 complex chemical conversions under physiological condi-                         to all of these related data resources.
 tions and retain their stereospecificity and regiospecificity,
 unlike many organic chemical reactions. They range in size
 and can have molecular weights of several thousand to sev-                      DATASET AND CONTENT
 eral million Daltons, and still they can catalyse reactions on                  The dataset for MACiE version 2 was devised to increase the
 molecules as small as carbon dioxide or nitrogen, or as large                   enzyme reaction space coverage of MACiE while trying to
 as a complete chromosome.                                                       keep structural homology to a minimum. Each entry added
    Although enzymes are large molecules, the actual catalysis                   in the new version was selected so that it fulfils the following
 only takes place in a small cavity, the active site. It is                      criteria:
 here that a small number of amino acid residues contribute
 to catalytic function, and where the substrates bind. With                        (i) The EC sub-subclass was not previously in MACiE.
 the advent of structure determination methods for proteins                       (ii) There is a three-dimensional crystal structure of the
 and by using clever chemical/biochemical experimental                                 enzyme deposited in the Protein Data Bank (wwPDB)
 design, scientists have been able to propose catalytic mecha-                         (10).
 nisms for many enzymes. Although a great deal of knowledge                      (iii) There is a mechanism available from the primary
 exists for enzymes, including their structures, gene                                  literature which explains most of the observed experi-
 sequences, mechanisms, metabolic pathways and kinetic                                 mental results.

 *To whom correspondence should be addressed. Tel: +44 1223 492535; Fax: +44 1223 494486; Email: gemma@ebi.ac.uk
 Present address:
 Gail J. Bartlett, Division of Mathematical Biology, National Institute of Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, UK

 Ó 2006 The Author(s).
 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
 by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.



Nucleic Acid Res. 2007, 35, D515-D520.
D516     Nucleic Acids Research, 2007, Vol. 35, Database issue




Figure 1. EC wheels showing the EC coverage of MACiE Version 2 (left), the complete EC space (centre) and the coverage of EC space in the PDB by unique
EC serial numbers (right).


(iv) The enzyme is unique at the H level of the CATH code                     Table 1. Overall reaction annotation content
     (11), unless the homologue already in MACiE has a
                                                                              Catalysis and reaction             Non-catalysis
     significantly different chemical mechanism.




                                                                                                                                                           Downloaded from http://nar.oxfordjournals.org/ by guest on October 22, 2011
                                                                              specific information               specific information

   Using the above criteria MACiE was expanded from                           Enzyme name                        PDB code
100 entries in version 1 to a total of 202 entries, which                       (common IUPAB/JCBN name)
                                                                              EC code                            Non-catalytic domain CATH code
span 199 EC numbers (version 1 spanned 96 EC numbers)                         Catalytic residues involved        Non-catalytic UniProt code
and covers a total of 862 reaction steps. There are almost                    Cofactors involved                 Species name (common and scientific)
4000 EC numbers defined, but the number of different                           Reactants and products             Other database
reaction mechanisms needed to bring about all these overall                                                        identifiers, e.g. EzCatDB, SFLD, etc.
transformations is not clear. For example, the serine protease                Catalytic domain CATH code         Literature references
                                                                              Catalytic UniProt code
family of proteins has many different substrates, but the                     Bonds involved, formed,
mechanisms are broadly similar. In contrast the b-lactamase                     cleaved, changed in order
enzymes, which have the same EC number, have four com-                        Reactive centres
pletely different mechanisms. Within the EC code, the fourth                  Overall reaction comments
digit usually defines the substrate specificity, which can be
very variable in large enzyme families—but the reaction
mechanisms for enzymes with the same first three digits are                    the current release such annotations are only available as
usually essentially the same. In total there are 224 EC sub-                  comments on the stage or overall reaction, although future
subclasses, with only 181 having known structures (12). Of                    releases of MACiE will include full entries for these alterna-
these MACiE covers 158, i.e. 87%. However, there are proba-                   tives.
bly many more mechanisms that are yet to be defined or                            Further details of the annotation process and a glossary of
discovered.                                                                   terms used can be found on the MACiE website (http://www.
   As can be seen from Figure 1, MACiE covers a good                          ebi.ac.uk/thornton-srv/databases/MACiE/documentation/ and
proportion of the EC reaction space, with an average relative                 http://www.ebi.ac.uk/thornton-srv/databases/MACiE/glossary.
difference between the size of corresponding EC classes                       html, respectively).
of 4%, with the transferases having the largest difference.
When the coverage with respect to EC code present in
the PDB is examined, it can be seen that MACiE again
                                                                              DATABASE STRUCTURE
represents the coverage of enzymes with known structures
very well, with an average relative difference between the                    The challenge with MACiE has been to capture and usefully
corresponding EC classes in MACiE of 5%.                                      represent all the different catalytic steps that occur during the
   All entries in MACiE contain overall reaction annotation                   course of an enzymatic reaction. These reactions may consist
including the information detailed in Table 1. Each elemen-                   of any number of steps, and in MACiE we have reactions
tary reaction or step within an entry is fully annotated as is                ranging from 1 step to 16 steps. The representation of these
detailed in Figure 2, this includes comments that have been                   reactions has evolved from a flat file entered in a commer-
added by the annotators. An extension of the content from                     cially available chemical database program (ISIS/Base) to
MACiE Version 1 is the addition of inferred return steps.                     the highly structured and powerful CMLReact (7), which is
These are explicitly labelled as being inferred in the comment                an application of XML (the eXtensible Markup Language).
field and are necessary to return the enzyme to a state where it               The final step in this evolution has been the conversion of
is ready to undergo another round of catalysis.                               the CMLReact into the relational database format of MySQL.
   There is sometimes more than one proposed mechanism that                      CMLReact has a heirarchical structure, facilitating its
is consistent with the available experimental data. In MACiE,                 conversion into the relational database format of MySQL.
we have attempted not only to choose the best supported                       The conversion relies on the CML Schema and requires the
mechanism, but also where possible to annotate enzymes                        MACiE entries to be consistent with the Schema, which
with reasonable alternative mechanisms. Unfortunately, in                     adds an internal consistency check into our authoring process.


       Nucleic Acid Res. 2007, 35, D515-D520.
Nucleic Acids Research, 2007, Vol. 35, Database issue          D517




                                                                                                                                                 Downloaded from http://nar.oxfordjournals.org/ by guest on October 22, 2011
Figure 2. An example of the annotation found in a MACiE entry. Reaction shown corresponds to fructose-bisphosphate aldolase (entry 52).



Table 2. Searches available in MACiE

Basic                          Complex

MACiE entry identifier         Species name (overall annotation)
Current EC codes               Overall reactants and products
Obsolete EC codes              Reaction comments (overall reactions
                                 and steps)
Catalytic Domain               Amino acid residues (up to six residues)
  CATH codes
All CATH codes                 Step mechanisms and/or mechanism
                                 components (single and combinations of)
PDB code                       Chemical changes                               Figure 3. EC code search heuristics.
Enzyme name                    Chemical changes with mechanism or
                                 mechanism components
Catalytic Domain               Chemical changes with amino acid               DATABASE FEATURES
  UniProt Codes                  residues
All UniProt Codes              Amino acid residues with mechanism or          The original release of MACiE contained static images and
                                 mechanism components                         annotation for the overall reaction and each step associated
                               Chemical changes with amino acid               with the mechanism; it also included an animated reaction
                                 residues and mechanisms or mechanism         mechanism for approximately half the reactions then in
                                 components
                               Alternative mechanisms
                                                                              MACiE. Links to various related resources, such as the
                                                                              RCSB PDB (13), IUBMB nomenclature database, CATH,
                                                                              EzCatDB, PDBSum (14), BRENDA, the Catalytic Site
                                                                              Atlas (15), KEGG and the Enzyme Structures Database,
Each CML tag-type becomes an MySQL table; each tag
                                                                              were also included. This new release extends these links to
becomes a row in that MySQL table; each attribute of that
                                                                              include the Macromolecular Structures Database (MSD)
tag corresponds to a column in the MySQL table. The tree
                                                                              (16), SFLD, UniProt (17), and replaces the IUBMB nomen-
structure of the CML is preserved in the MySQL version;
                                                                              clature database links with links to IntEnz. The new features
for each row of each table, there are columns specifying
                                                                              in MACiE are detailed in the following sections.
which row of which other table corresponds to the row’s
parent tag in the CML version.
                                                                              Searching MACiE
   The CML version of MACiE, which is the official archive
version, is available from the website as individual entries,                 There are two levels of search implemented in MACiE. The
and the new website uses the relational version of MACiE                      basic level searches are implemented from the main page
to perform the online analysis and searching.                                 (http://www.ebi.ac.uk/thornton-srv/databases/MACiE) and are


 Nucleic Acid Res. 2007, 35, D515-D520.
D518     Nucleic Acids Research, 2007, Vol. 35, Database issue




Figure 4. Advanced EC search heuristics.




                                                                                                                                  Downloaded from http://nar.oxfordjournals.org/ by guest on October 22, 2011
                                                                 walk up the EC code tree until it finds a match, no matter at
                                                                 what level the search is entered. Thus the search will always
                                                                 return a result. As the EC code of enzymes may change over
                                                                 time, a search for obsolete EC codes has also been imple-
                                                                 mented, although this search will not always return a result.
                                                                 However, it should be noted that the higher up the EC hierar-
                                                                 chy search has gone, the less likely it is that the returned
                                                                 mechanism will be a match to the query. The obsolete EC
                                                                 code search works in the same way as the current EC code.
                                                                    If no matches are found at the serial number level of the
Figure 5. PDB search heuristics.
                                                                 EC code, an advanced search option will allow the user to
                                                                 search for a structural homologue of an enzyme with a
                                                                 given EC code, which is shown in Figure 4 and described
                                                                 below. This advanced search option takes the entered EC
                                                                 code and finds the PDB codes of all of the matches to that
                                                                 EC code in the Catalytic Site Atlas (CSA). A homology
                                                                 search is then performed on those PDB codes for a match
                                                                 in MACiE. This homology search is described in more detail
                                                                 in the following section.
                                                                    The CSA is a database of catalytic residues in proteins of
                                                                 known structure. It contains much less mechanistic informa-
                                                                 tion than MACiE, but has a considerably wider coverage of
                                                                 protein structures than MACiE does. This wider coverage is
Figure 6. Enzyme name search heuristics.
                                                                 partly because the CSA contains not only manually annotated
                                                                 entries, but also contains entries that are automatically
                                                                 annotated based on sequence alignment to the manual entries.
mainly for accessing the entries from the top level, i.e. for
searching entries in MACiE by EC code, enzyme name,              PDB code. There are over 19 000 crystal structures relating to
etc. The complex searches are all available from the query       enzymes deposited in the PDB. As MACiE entries require
pages of MACiE (http://www.ebi.ac.uk/thornton-srv/databases/     extensive literature searching and analysis, only a small
MACiE/queryMACiE.html) and are mainly for searching for          fraction of these PDB entries are covered explicitly, 202 in
specific mechanisms, mechanism components or residues and         total. However, we have used the CSA to identify homologues
their functions in the reaction steps, although there are some   of these enzymes, extending this coverage to 7528 PDB codes.
overall reaction searches implemented as well. Table 2 lists        Figure 5 details the search performed in MACiE, when a
the searches available in MACiE and the Supplementary Data       protein structure described by a PDB code is entered.
contain a detailed listing of the searches available.            Although the entries returned by this search will be homo-
   The following sections describe searching by EC code,         logues, this does not guarantee that the mechanism and the
PDB code or enzyme name, all of which use heuristics to          catalytic residue assignments are the same. This is because
extend the coverage of MACiE.                                    the homology method (see below) can retrieve very distant
                                                                 relatives. Owing to this limitation, all homologous entries
EC code. The EC code search implemented in MACiE is              are compared by EC code, and when there is a divergence
detailed in Figure 3 and can be accessed at any point in the     between the MACiE entry and the homologue at the serial
scheme shown. The search for current EC numbers will always      number level, this is clearly indicated to the user. We also


       Nucleic Acid Res. 2007, 35, D515-D520.
Nucleic Acids Research, 2007, Vol. 35, Database issue                  D519


list the amino acid residues that are annotated as catalytic in                 the results page we link both to the MACiE entry and the
both MACiE and the CSA. Thus it is clear if there is any                        CSA entry.
difference between EC numbers and catalytic residues. If
the EC number differs but the catalytic residues between                           Homology in MACiE. We have been working to bring
query and homologue are of identical types, it can be inferred                  MACiE and the CSA closer together. This includes using
that the mechanisms are likely to be the same, but where both                   the CSA to determine homologues (those enzymes which
differ, the mechanisms are unlikely to be transferable. From                    are evolutionarily related) of entries in MACiE. The CSA
                                                                                finds homologues using a PSI-BLAST search (with an
                                                                                E-value cut-off of 0.0005 and five iterations) against all
                                                                                sequences currently in the PDB, plus all sequences in a
                                                                                non-redundant subset of UniProt. The UniProt sequences
                                                                                are included purely in order to increase the range of the
                                                                                PSI-BLAST search by bridging gaps between distantly
                                                                                related sequences in the PDB; only sequences occurring in
                                                                                the PDB are retrieved for entry into the CSA. In the CSA,
                                                                                and thus MACiE, homologous entries are only included
                                                                                if the residues which align with the catalytic residues in
                                                                                the parent literature entry are identical in residue type. In




                                                                                                                                                              Downloaded from http://nar.oxfordjournals.org/ by guest on October 22, 2011
                                                                                other words, there must be no mutations at the catalytic res-
                                                                                idue positions. There are, however, a few exceptions to this
                                                                                rule:
                                                                                  (i) In order to allow for the many active site mutants in the
                                                                                      PDB, one (and only one) catalytic residue per site can be
                                                                                      different in type from the equivalent in the parent
                                                                                      literature entry. This is only permissible if all residue
                                                                                      spacing is identical to that in the parent literature entry,
                                                                                      and there are at least two catalytic residues.
                                                                                 (ii) Sites with only one catalytic residue are permitted to be
                                                                                      mutant provided that the residue number is identical to
                                                                                      that in the parent entry.
                                                                                (iii) Fuzzy matching of residues is permitted within the
Figure 7. Growth of MACiE. This shows the growth in the number of EC                  following groups: [V,L,I], [F,W,Y], [S,T], [D,E], [K,R],
codes (blue), EC sub-sub classes (cyan) and catalytic domain CATH codes               [D,N], [E,Q], [N,Q]. This fuzzy matching cannot be used
(red) in MACiE.                                                                       in combination with rules (i) or (ii) above.




Figure 8. Frequency distribution of amino acid residues. This shows the frequency of catalytic amino acid residues in MACiE (blue), versus the frequency of
residues in MACiE (cyan), versus the frequency of residues in the wwPDB (red). The frequency of catalytic amino acid residues in MACiE is calculated by
taking the number of residues (of a given type) annotated in MACiE divided by the total number of annotated residues in MACiE, multiplied by 100.



 Nucleic Acid Res. 2007, 35, D515-D520.
D520    Nucleic Acids Research, 2007, Vol. 35, Database issue


Enzyme name. This is currently implemented as a partial            and is also affiliated with Cambridge University Department
string match, thus entering ‘beta’ will return all the             of Chemistry. Funding to pay the Open Access publication
b-lactamases and betaine-aldehyde dehydrogenase. If no             charges for this article was provided by the Wellcome Trust.
results are returned from the partial name search, then the
name search heuristics (shown in Figure 6) are implemented.        Conflict of interest statement. None declared.
   This search utilizes the IntEnz database (4). MACiE
searches for a name in IntEnz, either a synonym, alternative
name or common name, and returns the EC code of that               REFERENCES
name. The EC code is then used to search MACiE. If no
matches are found to the sub-subclass level of the EC code,         1. Schomburg,I., Chang,A., Ebeling,C., Gremse,M., Heldt,C., Huhn,G.
the user is offered an advanced EC code search (see Figure 4).         and Schomburg,D. (2004) BRENDA, the enzyme database: updates
                                                                       and major new developments. Nucleic Acids Res., 32,
                                                                       D431–D433.
Statistics                                                          2. Kanehisa,M., Goto,S., Kawashima,S., Okuno,Y. and Hattori,M. (2004)
The other major development in MACiE has been the                      The KEGG resource for deciphering the genome. Nucleic Acids Res.,
                                                                       32, D277–D280.
inclusion of database statistics that are all generated on the      3. IUBMB (2005) Recommendations of the Nomenclature Committee of
fly from the SQL tables. A full listing of the statistics               the International Union of Biochemistry and Molecular Biology on the
available can be found in the Supplementary Data. The                  nomenclature and classification of enzyme-catalysed reactions.
growth of MACiE is shown in Figure 7 in terms of EC                 4. Fleischmann,A., Darsow,M., Degtyarenko,K., Fleischmann,W.,




                                                                                                                                                 Downloaded from http://nar.oxfordjournals.org/ by guest on October 22, 2011
coverage and CATH coverage.                                            Boyce,S., Axelsen,K., Bairoch,A., Schomburg,D., Tipton,K.F. and
                                                                       Apweiler,R. (2004) IntEnz, the integrated relational enzyme database.
   The statistics in MACiE can also be used to examine the             Nucleic Acids Res., 32, D434–D437.
function and distribution of amino acid residues (G.L. Holliday,    5. Holliday,G.L., Bartlett,G.J., Almonacid,D.E., O’Boyle,N.M.,
D.E. Almonacid, J.M. Thornton and J.B.O. Mitchell,                     Murray-Rust,P., Thornton,J.M. and Mitchell,J.B.O. (2005) MACiE: a
manuscript in preparation) (see Figure 8), the distribution of         database of enzyme reaction mechanisms. Bioinformatics, 21,
                                                                       4315–4316.
mechanism and mechanism components and the bond order               6. Holliday,G.L., Mitchell,J.B.O. and Murray-Rust,P. (2004) CMLSnap:
changes occurring in each step of the reaction.                        animated reaction mechanisms. Internet J. Chem., 7, Article 4.
                                                                    7. Holliday,G.L., Murray-Rust,P. and Rzepa,H.S. (2006) Chemical
                                                                       Markup, XML, and the World Wide Web. 6. CMLReact, an
FUTURE DEVELOPMENTS                                                    XML vocabulary for chemical reactions. J. Chem. Inf. Model., 46,
                                                                       145–157.
MACiE is a continually developing resource, and in the              8. Pegg,S.C.-H., Brown,S.D., Ojha,S., Seffernick,J., Meng,E.C.,
future we hope to include 3D data, which will incorporate              Morris,J.H., Chang,P.J., Huang,C.C., Ferrin,T.E. and Babbitt,P.C.
                                                                       (2006) Leveraging enzyme structure–function relationships for
various statistics and searches related to the analysis of             functional inference and experimental design: the Structure–Function
these data. We will also continue to extend the coverage of            Linkage Database. Biochemistry, 45, 2545–2555.
MACiE to include alternative reaction mechanisms that               9. Nagano,N. (2005) EzCatDB: the Enzyme Catalytic-mechanism
have been suggested for various enzymes, as well as new                DataBase. Nucleic Acids Res., 33, D407–D412.
                                                                   10. Berman,H.M., Henrick,K. and Nakamura,H. (2003) Announcing
mechanisms. Finally, we intend to build a user interface               the worldwide Protein Data Bank. Nature Struct. Biol.,
which will allow for chemical diagrams to be drawn                     10, 980.
and used to search MACiE, an entry process which is more           11. Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and
usable and also to implement the classification of enzyme               Thornton,J.M. (1997) CATH—a hierarchic classification of protein
mechanisms that we are developing.                                     domain structures. Structure, 5, 1093–1108.
                                                                   12. Martin,A.C. (2004) PDBSprotEC: a Web-accessible database linking
                                                                       PDB chains to EC numbers via SwissProt. Bioinformatics, 20,
                                                                       986–988.
SUPPLEMENTARY DATA                                                 13. Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N.,
                                                                       Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data
Supplementary Data are available at NAR Online.                        Bank. Nucleic Acids Res., 28, 235–242.
                                                                   14. Laskowski,R.A., Chistyakov,V.V. and Thornton,J.M. (2005)
                                                                       PDBsum more: new summaries and analyses of the known 3D
ACKNOWLEDGEMENTS                                                       structures of proteins and nucleic acids. Nucleic Acids Res., 33,
                                                                       D266–D268.
We would like to thank the EPSRC (G.L.H. and J.B.O.M.),            15. Porter,C.T., Bartlett,G.J. and Thornton,J.M. (2004) The Catalytic Site
BBSRC (G.J.B. and J.M.T.—CASE studentship in associa-                  Atlas: a resource of catalytic sites and residues identified in enzymes
                                                                       using structural data. Nucleic Acids Res., 32, D129–D133.
tion with Roche Products Ltd; N.M.O.B. and J.B.O.M.—grant          16. Golovin,A., Oldfield,T.J., Tate,J.G., Velankar,S., Barton,G.J.,
BB/C51320X/1), the Wellcome Trust, EMBL, IBM (G.L.H.                   Boutselakis,H., Dimitropoulos,D., Fillon,J., Hussain,A., Ionides,J.M.
and J.M.T.), the Chilean Government’s Ministerio de                    et al. (2004) E-MSD: an integrated data resource for bioinformatics.
           ´              ´
Planificacion y Cooperacion and the Cambridge Overseas                 Nucleic Acids Res., 32, D211–D216.
                                                                   17. Bairoch,A., Apweiler,R., Wu,C.H., Barker,W.C., Boeckmann,B.,
Trust (D.E.A.) for funding and Unilever for supporting the             Ferro,S., Gasteiger,E., Huang,H., Lopez,R., Magrane,M. et al. (2005)
Centre for Molecular Science Informatics. J.W.T. is funded             The Universal Protein Resource (UniProt). Nucleic Acids Res., 33,
by a European Molecular Biology Laboratory studentship,                D154–D159.




       Nucleic Acid Res. 2007, 35, D515-D520.
Part III

QSAR




   49
My Open Access papers
Vol. 22 no. 20 2006, pages 2565–2566
 BIOINFORMATICS APPLICATIONS NOTE                                                                                      doi:10.1093/bioinformatics/btl416



 Data and text mining

 PYCHEM: a multivariate analysis package for python
 Roger M. Jarvis1,4,Ã, David Broadhurst1,4, Helen Johnson2, Noel M. O’Boyle3 and
 Royston Goodacre1,4
 1
   School of Chemistry, The University of Manchester, PO Box 88, Sackville Street, Manchester M60 1QD, UK,
 2
   Faculty of Life Sciences, University of Manchester, Stopford Building, Oxford Road, Manchester M13 9PT, UK,
 3
   Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield
 Road, CB2 1EW, UK and 4Manchester Interdisciplinary Biocentre, 131 Princess Street, Manchester M1 7DN, UK
 Received on April 4, 2006; revised on July 5, 2006; accepted on July 26, 2006
 Advance Access publication July 31, 2006
 Associate Editor: Martin Bishop


                                                                                 trait(s); and discriminant analysis, for distinguishing between




                                                                                                                                                               Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on October 22, 2011
 ABSTRACT
 Summary: We have implemented a multivariate statistical analysis                different sample groups and for subsequent predictions on new
 toolbox, with an optional standalone graphical user interface (GUI),            samples. In fact, multivariate analysis encompasses many more
 using the Python scripting language. This is a free and open source             methods than these examples of linear modeling imply (Brereton,
 project that addresses the need for a multivariate analysis toolbox             2003); but these tools are perhaps those most commonly used for the
 in Python. Although the functionality provided does not cover the full          modeling of biological data.
 range of multivariate tools that are available, it has a broad comple-             Many programs currently exist for multivariate analysis. Flexible
 ment of methods that are widely used in the biological sciences. In             environments for mathematical computing are available in the
 contrast to tools like MATLAB, PyChem 2.0.0 is easily accessible                form of MATLAB (The Mathworks, Natick, MA, USA), GNU
 and free, allows for rapid extension using a range of Python modules            Octave (http://www.octave.org/) which aims to be a free equivalent
 and is part of the growing amount of complementary and interoperable            of MATLAB, and R (http://www.r-project.org/); which has many
 scientific software in Python based upon SciPy. One of the                      other bio-analysis modules, such as Vegan (for environmetrics) and
 attractions of PyChem is that it is an open source project and so               Bioconductor (for genomic analysis). These products provide
 there is an opportunity, through collaboration, to increase the scope           powerful tools for multivariate analysis through command line
 of the software and to continually evolve a user-friendly platform that         interpreters, which allow the user to perform their analysis with
 has applicability across a wide range of analytical and post-genomic            a great degree of flexibility. However, they require some investment
 disciplines.                                                                    in time to become familiar with the interpreters syntax, and are not
 Availability: http://sourceforge.net/projects/pychem                            necessarily straightforward for people with little computa-
 Contact: Roger.Jarvis@manchester.ac.uk or admin@pychem.org.uk                   tional experience. In addition, a number of graphical multivariate
 Supplementary information: Further information is available from the                                                                       ˚
                                                                                 software tools are also available; Evince (UmBio, Umea, Sweden),
 project home page at http://pychem.sf.net/ whilst details of data gen-          The Unscrambler (CAMO, Woodbridge, NJ, USA), Pirouette
 eration are available at http://biospec.net/                                    (Infometrix, Bothell, WA, USA), S-Plus (Insightful, Seattle, WA.
                                                                                                                       ˚
                                                                                 USA) and SIMCA (Umetrics, Umea, Sweden) are all good tools for
 1      INTRODUCTION                                                             basic multivariate analysis although, with the exception of S-Plus,
                                                                                 they lack the flexibility of the interpreter style interfaces.
 Increasingly in the life sciences many experiments generate data                   Thus there is currently a requirement for a flexible, extensible, free
 which are of a multivariate nature, where many observations are                 and open source graphical environment for performing multivariate
 recorded for each sample under analysis. Interpretation of such                 analysis, which can be used by both experts and casual users. The
 complex data cannot generally be performed by taking a univariate               increasing popularity of scripting languages such as Python (http://
 approach, since no single measurement is necessarily adequate                   www.python.org/) within the life sciences community offers the
 enough to describe the problem being addressed. In fact, the                    technology and critical mass for such a project. A platform of this
 application of univariate methodology is in many cases totally inap-            type addresses the requirements outlined above, with the additional
 propriate as the complexity of information contained within large               benefit that it allows for the rapid development of new cross-platform
 biological datasets reflects the complexity of the system(s) being               software approaches, and the integration of currently available soft-
 studied. Typical multivariate analysis problems involve unsuper-                ware libraries through application programming interfaces (APIs).
 vised learning such as factor analysis, for reducing the dimension-
 ality of data and modeling of variance; linear regression, for
                                                                                 2    THE MULTIVARIATE ANALYSIS TOOLBOX
 formulating input to output transformation models based on super-
 vised learning which are predictive generally for quantitative
                                                                                      FOR PYTHON
                                                                                 The PyChem project aims to provide a simple multivariate
 Ã
     To whom correspondence should be addressed.                                 analysis toolbox with a powerful and intuitive GUI front-end.



 Ó 2006 The Author(s)
 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commerical License (http://creativecommons.org/licenses/
 by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.


Bioinformatics. 2006, 22, 2565-2566.
R.M.Jarvis et al.



                                                                              stored and the progression of the analysis recorded, which is
                                                                              particularly useful for tracking data analyses as part of GLP. The
                                                                              additional benefit of using the XML data structure for storage is that
                                                                              it introduces the potential for engineering simple bespoke interfaces
                                                                              to database storage systems.
                                                                                  PyChem provides simple grid-style user interfaces for the input of
                                                                              experimental and sample metadata, so producing a series of vectors
                                                                              describing the origin and identity of each sample and measured
                                                                              variable. For unsupervised analyses, such as PCA, the software
                                                                              simply requires a vector, or multiple vectors of sample labels for
                                                                              plotting; in addition for supervised analyses, vectors are required to
                                                                              (1) represent putative class structures or some quantitative trait
                                                                              (e.g., level of abiotic or biotic interference) and (2) identify groups
                                                                              in to which the data should be split for the purpose of cross-
                                                                              validation. In supervised analyses the issue of model validation
                                                                              is crucial; when a model is formulated there is a possibility that
Fig. 1. A screenshot demonstrating the feature selection functionality        it will overfit the data and find a relationship between the data and




                                                                                                                                                                        Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on October 22, 2011
available in PyChem, in this example microarray data (Golub et al., 1999)     the target class structure or dependent variables, which does not
have been analysed consisting of 72 samples represented by 7070 genes.        hold for subsequent predictions; i.e. the model has learnt the training
The GA directed search can be used to highlight genes that are particularly   data perfectly and is not able to generalize. This situation can be
important for discrimination.                                                 avoided by performing some form of model validation. In the
                                                                              current version of PyChem (2.0.0) we use the preferred approach
The project is implemented in Python and utilizes the                         of data splitting (Brereton, 2003), which works by dividing the
wxPython (http://www.wxpython.org/), Boa Constructor (http://                 measured X-variables in to three groups; a model training set,
boa-constructor.sourceforge.net/) and SciPy (http://scipy.org/)               model cross-validation data and finally an independent test set.
packages (see Fig. 1 for an example screenshot) amongst others.               The model is trained on the first set, optimized on the second set
The software was designed to provide a range of algorithms that               and then tested for accuracy on the third set of ‘hold-out’ data.
address three fundamental questions commonly asked by the                         A major emphasis of this work has been in providing clear
researcher.                                                                   and useful graphical reports for the interpretation of results. The
                                                                              GUI uses wxPyPlot (http://www.cyberus.ca/~g_will/wxPython/
  (1) What is the shape of the data—including sources of variance             wxpyplot.html), with a small modification to include text plotting.
      and outlier identification?                                             In the future even more focus will be given to the structure of graphical
  (2) How similar are different samples?                                      reporting in PyChem, as well as the functionality associated with the
                                                                              plotting canvases. Finally, all results, both graphical and numerical,
  (3) Which measurements from the original data can be attributed
                                                                              can easily be exported from PyChem, with numerical results in ASCII
      to observed differences and/or similarities?
                                                                              file format to allow for use in other software applications.
   To help answer these questions, the initial release includes
algorithms for the pre-processing of multivariate data (such as               ACKNOWLEDGEMENTS
scaling, baseline correction, filtering and derivatization), principal
                                                                              R.M.J., D.B., H.J., N.M.O.B. and R.G. would like to thank the
components analysis (PCA) (Jolliffe, 1986), partial least squares
                                                                              BBSRC for funding (NMOB; grant BB/C51320X/1). Funding
regression (PLS1) (Martens and Naes, 1989), discriminant function
                                                                              to pay the Open Access publication charges for this article was
analysis (DFA) (Manly, 1994), cluster analysis [using the C clus-
                                                                              provided by the BBSRC.
tering library for Python (http://bonsai.ims.u-tokyo.ac.jp/
~mdehoon/software/cluster/) (Eisen et al., 1998; de Hoon et al.,              Conflict of Interest: none declared.
2004)], and a number of genetic algorithm (GA) based tools for
performing feature selection (Jarvis and Goodacre, 2005), see Fig. 1.         REFERENCES
   The software is able to handle any 2D dataset where each sample
                                                                              Brereton,R. (2003) Chemometrics: data analysis for the laboratory and chemical plant,
is defined by a series of discrete or continuous measurements. Data                 1st edn. Chichester: John Wiley  Sons Ltd.
can be imported from flat ASCII files that use the standard delim-              Eisen,M. et al. (1998) Cluster analysis and display of genome-wide expression patterns.
iters. Typical data of this type include those generated from                      Proc. Natl Acad. Sci.USA, 95, 14863–14868.
microarrays, proteomics, spectroscopic methods (UV-Vis, infrared              de Hoon,M. et al. (2004) Open source clustering software. Bioinformatics, 20,
                                                                                   1453–1454.
and Raman), mass spectrometry, NMR, or indeed any data arrays
                                                                              Golub,T. et al. (1999) Molecular classification of cancer: class discovery and class
representing samples for which multiple discrete measurements                      prediction by gene expression monitoring. Science, 286, 531–537.
have been acquired. Once data have been imported into PyChem                  Jarvis,R. and Goodacre,R. (2005) Genetic algorithm optimization for pre-processing
they can be saved in an XML format [implemented using cElement-                    and variable selection of spectroscopic data. Bioinformatics, 21, 860–868.
Tree (http://effbot.org/)] as a PyChem experiment, which allows for           Jolliffe,I.T. (1986) Principal Component Analysis. Springer-Verlag, New York.
                                                                              Manly,B.F.J. (1994) Multivariate Statistical Methods: A Primer. Chapman  Hall/
the subsequent storage of multiple experimental results within a                   CRC, New York.
single file. This allows for the capture of the state of the system            Martens,H. and Naes,T. (1989) Multivariate Calibration. John Wiley  Sons,
at a point in time, so that results of multivariate analyses can be                Chichester.



2566


  Bioinformatics. 2006, 22, 2565-2566.
Chemistry Central Journal
 Methodology                                                                                                                            Open Access
 Simultaneous feature selection and parameter optimisation using
 an artificial ant colony: case study of melting point prediction
 Noel M O'Boyle*1,2, David S Palmer1,3, Florian Nigsch1 and
 John BO Mitchell1

 Address: 1Unilever Centre for Molecular Science Informatics, Dept. of Chemistry, University of Cambridge, Lensfield Rd, Cambridge, CB2 1EW,
 UK, 2Cambridge Crystallographic Data Centre, 12 Union Rd, Cambridge, CB2 1EZ, UK and 3Department of Chemistry, Aarhus University, 8000
 Aarhus C, Denmark
 Email: Noel M O'Boyle* - baoilleach@gmail.com; David S Palmer - dsp@chem.au.dk; Florian Nigsch - fn211@cam.ac.uk;
 John BO Mitchell - jbom1@cam.ac.uk
 * Corresponding author




 Published: 29 October 2008                                                      Received: 1 August 2008
                                                                                 Accepted: 29 October 2008
 Chemistry Central Journal 2008, 2:21   doi:10.1186/1752-153X-2-21
 This article is available from: http://journal.chemistrycentral.com/content/2/1/21
 © 2007 O'Boyle et al
 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
 which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.




                   Abstract
                   Background: We present a novel feature selection algorithm, Winnowing Artificial Ant Colony
                   (WAAC), that performs simultaneous feature selection and model parameter optimisation for the
                   development of predictive quantitative structure-property relationship (QSPR) models. The
                   WAAC algorithm is an extension of the modified ant colony algorithm of Shen et al. (J Chem Inf
                   Model 2005, 45: 1024–1029). We test the ability of the algorithm to develop a predictive partial
                   least squares model for the Karthikeyan dataset (J Chem Inf Model 2005, 45: 581–590) of melting
                   point values. We also test its ability to perform feature selection on a support vector machine
                   model for the same dataset.
                   Results: Starting from an initial set of 203 descriptors, the WAAC algorithm selected a PLS model
                   with 68 descriptors which has an RMSE on an external test set of 46.6°C and R2 of 0.51. The
                   number of components chosen for the model was 49, which was close to optimal for this feature
                   selection. The selected SVM model has 28 descriptors (cost of 5, ε of 0.21) and an RMSE of 45.1°C
                   and R2 of 0.54. This model outperforms a kNN model (RMSE of 48.3°C, R2 of 0.47) for the same
                   data and has similar performance to a Random Forest model (RMSE of 44.5°C, R2 of 0.55).
                   However it is much less prone to bias at the extremes of the range of melting points as shown by
                   the slope of the line through the residuals: -0.43 for WAAC/SVM, -0.53 for Random Forest.
                   Conclusion: With a careful choice of objective function, the WAAC algorithm can be used to
                   optimise machine learning and regression models that suffer from overfitting. Where model
                   parameters also need to be tuned, as is the case with support vector machine and partial least
                   squares models, it can optimise these simultaneously. The moving probabilities used by the
                   algorithm are easily interpreted in terms of the best and current models of the ants, and the
                   winnowing procedure promotes the removal of irrelevant descriptors.




                                                                                                                                        Page 1 of 15
Chem. Cent. J. 2008, 2, 21.                                                                                       (page number not for citation purposes)
Chemistry Central Journal 2008, 2:21                                       http://journal.chemistrycentral.com/content/2/1/21




Background                                                      mate of how well the model will generalise to unseen data
Quantitative Structure-Activity and Structure-Property          drawn from the same distribution. The purpose of the
Relationship (QSAR and QSPR) models are based upon              search is to find the feature selection that optimises this
the idea, first proposed by Hansch [1], that a molecular        value. The most well-known deterministic wrapper is
property can be related to physicochemical descriptors of       sequential forward selection [11] (SFS) which involves
the molecule. A QSAR model for prediction must be able          successive additions of the feature that most improves the
to generalise well to give accurate predictions on unseen       objective function to the subset of descriptors already cho-
test data. Although it is true in general that the more         sen. A related algorithm, sequential backwards elimina-
descriptors used to build a model, the better the model         tion [12] (SBE), successively eliminates descriptors
predicts the training set data, such a model typically has      starting from the complete set of descriptors. Both of these
very poor predictive ability when presented with unseen         algorithms suffer from the problem of 'nesting'. In the
test data, a phenomenon known as overfitting [2]. Feature       case of SFS, nesting refers to the fact that once a particular
selection refers to the problem of selecting a subset of the    feature is added it cannot be removed at a later stage, even
descriptors which can be used to build a model with opti-       if this would increase the value of the objective function.
mal predictive ability [3]. In addition to better prediction,   More sophisticated methods, such as the sequential for-
the identification of relevant descriptors can give insight     ward floating selection (SFFS) algorithm of Pudil et al.
into the factors affecting the property of interest.            [13], include a backtracking phase after each addition
                                                                where variables are successively eliminated if this
The number of subsets of a set of n descriptors is 2n-1.        improves the objective function. Wrapper methods spe-
Unless n is small (20) it is not feasible to test every pos-   cific to certain models have also been developed. For
sible subset, and the number of descriptors calculated by       example, the Recursive Feature Elimination algorithm of
cheminformatics software is usually much larger (CDK            Guyon et al. [14] and the Incremental Regularised Risk
[4], MOE [5] and Sybyl [6] can respectively calculate a         Minimisation of Fröhlich et al. [15] are specific to models
total of 95, 146 and 248 1D and 2D descriptors). Feature        built using support vector machines.
selection methods can be divided into two main classes:
the filter approach and the wrapper approach [3,7,8]. The       Stochastic wrappers attempt to deal with the size of the
filter approach does not take into account the particular       search space by incorporating some degree of randomness
model being used for prediction, but rather attempts to         into the search strategy. The most well known of these
determine a priori which descriptors are likely to contain      algorithms is the genetic algorithm [16] (GA), whose
useful information. Examples of this approach include           search procedure mimics the biological process of evolu-
ranking descriptors by their correlation with the target        tion. A number of models are created randomly in the first
value or by estimates of the mutual information (based on       generation, the best of which (as measured by the objec-
information theory) between each descriptor and the             tive function) are selected and interbred in some way to
response. Another commonly used filter in QSAR is the           create the next generation. A mutation operator is applied
removal of highly correlated (or anti-correlated) descrip-      to the new models so that random sampling of the local
tors [9]. Liu [10] presents a comparison of five different      space occurs. Over the course of many generations, the
filters in the context of prediction of binding affinities to   objective function is optimised. Genetic algorithms were
thrombin. The filter approach has the advantages of speed       first used for feature selection in QSAR by Rogers and
and simplicity, but the disadvantage that it does not           Hopfinger [17] and are now used widely [9,18,19]. Other
explicitly consider the performance of the model contain-       stochastic methods which have been used for feature
ing different features. Correlation criteria can only detect    selection in QSAR are particle swarm optimisation [20,21]
linear dependencies between descriptor values and the           and simulated annealing [22].
response, but the best performing QSAR models are often
non-linear (support vector machines (SVM), neural net-          An additional difficulty in the development of QSAR
works (NN) and random forests (RF), for example). In            models is the fact that some regression methods have
addition, Guyon and Elisseeff show that very high correla-      parameters that need to be optimised to obtain the best
tion (or anti-correlation) does not necessarily imply an        performance for a particular problem. The Support Vector
absence of feature complementarity, and also that two           Machine (SVM) is an example of such a method. A SVM is
variables that are useless by themselves can be useful          a kernel-based machine learning method used for both
together [3].                                                   classification and regression [23-25] which has shown
                                                                very good performance in QSAR studies [9]. In ε-SVM
The wrapper approach conducts a search for a good fea-          regression, the algorithm finds a hyperplane in a trans-
ture selection using the induction algorithm as a black box     formed space of the inputs that has at most ε deviation
to evaluate subsets and calculate the value of an objective     from the output y values. Deviations greater than ε are
function. The objective function should provide an esti-        penalised by multiplying by a cost value C. The transfor-


                                                                                                                    Page 2 of 15
  Chem. Cent. J. 2008, 2, 21.                                                                 (page number not for citation purposes)
Chemistry Central Journal 2008, 2:21                                        http://journal.chemistrycentral.com/content/2/1/21



mation of the inputs is carried out by means of kernel           Since the ANTSELECT algorithm uses only a single ant, it
functions, which allows nonlinear relationships between          cannot make use of one of the most important features of
the inputs and the outputs to be handled by this essen-          ant colony algorithms, collective intelligence. Instead,
tially linear method. For a particular problem and kernel,       premature convergence will occur due to positive rein-
the values of C and ε must be tuned.                             forcement of models that have performed well earlier in
                                                                 the local search. In addition, the search space will be
Here we describe WAAC, Winnowing Artificial Ant Col-             poorly covered. Although the authors recommend that
ony, a stochastic wrapper for feature selection and param-       the algorithm should be repeated several times to mini-
eter optimisation that combines simultaneous                     mise the likelihood of convergence to a poor local mini-
optimisation of the selected descriptors and the model           mum, the use of an ant colony is a much more robust
parameters to create a model with good predictive accu-          solution.
racy. This method does not require any pre-processing of
the data apart from removal of zero-variance and dupli-          Shen et al. [26] presented an ACO algorithm that differed
cate descriptors. The only requirement is that allowed val-      from ANTSELECT in several ways. Their algorithm, which
ues of parameters of the models must be specified. As a          they called a modified ACO, is similar to our WAAC algo-
result, this method is suitable for use as an automatic gen-     rithm in that it involves a colony of ants, each of which
erator of predictive models.                                     remembers its best model and score, as well as its current
                                                                 model and score. In Shen et al.'s algorithm, for every
The WAAC algorithm is a novel stochastic wrapper                 descriptor there are both positive and negative weights.
derived from the modified Ant Colony Optimisation                The probability that an ant will choose a particular
(ACO) algorithm of Shen et al. [26]. Ant colony algo-            descriptor is given by the positive weight for that descrip-
rithms take their inspiration from the foraging of ants          tor divided by the sum of the positive and negative
whose cooperative behaviour enables the shortest path            weights. After every iteration, the weights are reduced by
between nest and food to be found [27]. Ants deposit a           multiplying by (1-ρ) as for ANTSELECT. The positive
substance called pheromone as they walk, thus forming a          weight for a particular descriptor is increased by the sum
pheromone trail. At a branching point, an ant is more            of the fitness scores of all ants in the current iteration that
likely to choose the trail with the greater amount of phe-       have selected it, as well as the fitness scores of the best
romone. Over time as pheromones evaporate, only those            models of all ants that have selected it in that model. Sim-
trails that have been reinforced by the passage of many          ilarly, the negative weight for a particular descriptor is
ants will retain appreciable amounts of pheromone, with          decreased by an amount based on the fitness scores of
the shortest trail having the greatest amount of pherom-         models that have not selected it.
one. In the end, all of the ants will travel by the shortest
trail. Artificial ant colony systems may be used to solve        In the following section, we describe the WAAC algorithm
combinatorial optimisation problems by making use of             in detail, as well as the dataset and model used to test the
the ideas of cooperation between autonomous agents               algorithm. In the Results and Discussion sections, we
through global knowledge and positive feedback that are          describe the performance of the WAAC algorithm, com-
observed in real ant colonies [28].                              pare it to other models on the same dataset, and discuss
                                                                 some practical considerations in usage.
The first use of artificial ant systems for variable selection
in QSAR was the ANTSELECT algorithm of Izrailev and              Methods
Agrafiotis [29]. The ANTSELECT algorithm involves the            WAAC algorithm
movement of a single ant through feature space. Initially        The WAAC algorithm uses a population of candidate
equal weights are assigned to each descriptor. The proba-        models termed an 'ant colony'. Each ant represents a
bility of the ant choosing a particular descriptor in the        model; that is, it is associated with a particular feature
next iteration is the weight for that descriptor divided by      selection as well as particular values for the model (for
the sum of all weights. After the fitness of the model is        example, SVM) parameters. The set of descriptors is stored
assessed, all of the weights are reduced by multiplying by       as a binary fingerprint of length F (the number of descrip-
(1-ρ), where ρ is the evaporation coefficient. The weights       tors), where a value of 1 for the nth bit indicates that the
of those descriptors selected in the current iteration are       nth descriptor is selected, and 0 indicates that it is not. For
then increased by a constant multiple of the fitness score.      each parameter of the model, a range of discrete values is
Gunturi et al. [30] used a modification of the ANTSELECT         required. The parameter values used by a particular ant are
algorithm in a recent study of human serum albumin               stored in a list of length P, where P is the number of
binding affinity in which the number of features selected        adjustable parameters of the model. The fitness of each
was fixed a priori and, in addition, could not include           model is measured using an objective function specified
descriptors that had a correlation coefficient greater than      by the user.
0.75.

                                                                                                                      Page 3 of 15
  Chem. Cent. J. 2008, 2, 21.                                                                   (page number not for citation purposes)
Chemistry Central Journal 2008, 2:21                                       http://journal.chemistrycentral.com/content/2/1/21



The initial population of ants is randomly placed in fea-       moving probability is used to determine the chance that a
ture and parameter space. The bits of the binary finger-        particular ant will select a particular descriptor in the next
prints representing the feature selections are initialised to   iteration. At the start of the optimisation phase, the mov-
either 0 or 1 with equal probability, so that on average        ing probabilities for all of the descriptors will be approxi-
each ant corresponds to a model based on approximately          mately equal to 0.5 (since the best model will be the
50% of the descriptors. Conversely, each descriptor is ini-     current model and each descriptor is selected by approxi-
tially selected by approximately 50% of the ants. The ini-      mately 50% of the ants).
tial parameter values for each ant are chosen at random
from the available values for each parameter.                   Similarly, for each parameter there is a moving probabil-
                                                                ity associated with every allowed value. These moving
Figure 1 shows a schematic of the WAAC algorithm. After         probabilities sum to unity (since each ant needs to select
initialisation, the algorithm enters the optimisation           exactly one allowed value for each parameter), and are cal-
phase. For each descriptor, a moving probability is calcu-      culated by taking the average of the fraction of ants which
lated by taking the average of the fraction of ants which       have currently selected a particular allowed value and the
have currently selected that descriptor and the fraction        fraction of ants that have selected that value in their best
that have selected that descriptor in their best model. This    model. At the start of the optimisation phase, each
                                                                allowed value of a parameter will be selected by approxi-
                                                                mately N/P ants where N is the number of ants, and P the
                                                                number of allowed values.

                                                                At the start of the optimisation phase, the ants move more
                                                                or less randomly, as the moving probabilities are essen-
                                                                tially equal for all features and parameter values. How-
                                                                ever, over the course of the optimisation phase as
                                                                particular descriptors are found to occur frequently in the
                                                                best models associated with the ants, due to positive feed-
                                                                back these descriptors will be more likely to be chosen in
                                                                subsequent iterations. This global optimisation procedure
                                                                is combined with local optimisation due to the influence
                                                                of the current positions of the ants on the moving proba-
                                                                bilities. Note that the ants do not move about relative to
                                                                their position in a previous iteration; rather, their subse-
                                                                quent location in feature space is determined by the best
                                                                and current feature selections of all of the ants. Note that
                                                                nesting is not a problem, as in each step of the optimisa-
                                                                tion the ants are free to explore descriptor combinations
                                                                which did not exist in the previous step.

                                                                After multiple iterations of the optimisation algorithm, a
                                                                winnowing procedure is applied. This reduces the search
                                                                space by retaining only those descriptors that have been
                                                                chosen by at least 20% of the ants in their best models,
                                                                and removing the rest. Parameter values are reinitialised
                                                                randomly. Some descriptors may be retained that do not
                                                                improve the models, but the subsequent reinitialisation
                                                                of the ants on the smaller search space will allow the sub-
                                                                sequent optimisation phase to identify better models
                                                                which exclude that descriptor. Note that no information is
                                                                carried from one optimisation procedure to the next. In
                                                                particular, memory of previous best models does not
                                                                guide future searching. This means that the randomly ini-
                                                                tialised models in the new optimisation phase are always
Figure of
Outline 1 the WAAC algorithm                                    poorer than the best models of the previous phase, but the
Outline of the WAAC algorithm.                                  reduction in the size of the feature space means that the



                                                                                                                    Page 4 of 15
  Chem. Cent. J. 2008, 2, 21.                                                                 (page number not for citation purposes)
Chemistry Central Journal 2008, 2:21                                         http://journal.chemistrycentral.com/content/2/1/21



performance of the model quickly recovers and matches
or improves on earlier performance.                                                                      n

                                                                                                     ∑( y                             )
                                                                                                 1                                        2
                                                                                RMSE =                           obs
                                                                                                                 i     − y ipred                            (1)
As shown in Figure 1, the optimisation phase and win-                                            n
                                                                                                     i =1
nowing procedure are repeated until convergence is
achieved or a specific number of iterations have occurred.                           n                                  n

                                                                                    ∑( y                       ) /∑( y                              )
The best model found at any point in the entire optimisa-                                                        2                                      2
                                                                         R2 = 1 −          obs
                                                                                           i     − y ipred                        obs
                                                                                                                                  i       − y obs
tion procedure should be chosen as the final best model.
                                                                                    i =1                               i =1
An implementation of WAAC in R [31] is available from
the authors on request.                                                                                                                                     (2)

                                                                                                     n

                                                                                                  ∑( y                            )
Dataset                                                                                       1
We use the Karthikeyan dataset [32] of melting point val-                            bias =                     obs
                                                                                                                i     − y ipred                             (3)
                                                                                              n
ues as described in Nigsch et al. [33]. This is a dataset of                                      i =1
melting points of 4119 diverse organic molecules which             In the prediction of the external test set, an outlier is
cover a range of melting points from 14 to 392.5°C, with           defined as any point with a residual greater than 4 stand-
a mean of 167.3°C and a standard deviation of 66.4°C.              ard deviations from the mean.
Each molecule is described by 203 2D and 3D descriptors,
which is the full range of descriptors available in the soft-      Models
ware MOE 2004.03 [5].                                              We used the WAAC algorithm to simultaneously optimise
                                                                   the chosen features and number of components in a Par-
The dataset was randomly divided 2:1 into training data            tial Least Squares (PLS) model. The plsr method in the pls
and an external test set (1373 molecules, see additional           package in R [31] was used to build the PLS model. Scal-
file 1: externaltest.csv for the original data). The training      ing was set to true. A range of 20 allowed parameter values
data was further randomly divided 2:1 into a training set          for the number of components in the model was initially
used for model building (1831 molecules, see additional            set to cover from 1 to 191 inclusive in steps of 10. After
file 2: internaltraining.csv) and an internal test set (915        each winnowing, the step size was reset so that the maxi-
molecules, see additional file 3: internaltest.csv).               mum value for the number of components was less than
                                                                   the number of remaining descriptors. For the WAAC algo-
Objective function                                                 rithm itself, a colony of 50 ants was used, and the algo-
The goal of the WAAC algorithm is to find the feature sub-         rithm was run for 800 iterations with winnowing every
set and parameter values that will give the best predictive        100 iterations. For comparison, the algorithm was run for
accuracy for a model based on given training data. During          the same length without any winnowing.
the course of the optimisation, the algorithm needs to be
guided by an objective function that will give an estimate         In addition, we used the WAAC algorithm to optimise a
of the predictive accuracy of a particular model.                  Support Vector Machine (SVM) model. The svm method
                                                                   in the e1071 package in R [31] was used to perform ε-
Here we examine the performance of the WAAC algorithm              regression with a radial basis function. A range of allowed
on the Karthikeyan dataset using as our objective function         parameter values for the SVM were chosen based on a pre-
the root mean squared error of the predictions on the              liminary run: values for C from 1 to 31 inclusive in steps
internal test set, RMSE(int). Each model is built on the           of 2, and values of ε from 0.01 to 1.61 inclusive in steps
training set using whatever features and parameter values          of 0.1. Since two parameters needed to be optimised for
have been selected, and then used to predict the melting           this model, the length of each optimisation phase in the
point values for the internal test set.                            WAAC algorithm was extended to 150 iterations and the
                                                                   algorithm was run for 1500 iterations in total.
Statistical testing
To assess the quality of a model, we report three statistics:      To compare to other feature selection methods, we used
the squared correlation coefficient, R2, the Root-Mean-            the training data to build a Random Forest model [34]
Square-Error, RMSE, and the bias. These are defined in             using the randomForest package in R (using the default set-
Equations 1 to 3. A parenthesis nomenclature is used to            tings of mtry = N/3, ntree = 500, nodesize = 5). We also
indicate whether the statistic refers to a model tested on         compared to the best of thirteen k Nearest Neighbours
the entire training data (tr) (this includes the internal test     (kNN) models trained on the training set, where k was 1,
set), the internal test set only (int), or the external test set   5, 10 or 15. For the models based on multiple neighbours,
(ext).                                                             separate models were created where the predictions were
                                                                   combined using exponential, geometric, arithmetic, or
                                                                   inverse distance weighting (for more details, see Nigsch et

                                                                                                                                              Page 5 of 15
  Chem. Cent. J. 2008, 2, 21.                                                                                (page number not for citation purposes)
Chemistry Central Journal 2008, 2:21                                     http://journal.chemistrycentral.com/content/2/1/21



al. [33]). The best performing model, as measured by          applied to the selected chromosomes, as a single-point
leave-one-out cross validation on the training data, was      crossover between randomly selected (with replacement)
the 15 NN model with exponential weighting. Hereafter,        chromosomes yielding a pair of children in each case.
this model is referred to as the kNN model.                   Each child was subject to a mutation operator which, for
                                                              a given bit on a chromosome, had a probability of 0.04 of
Genetic algorithm                                             flipping it. The process of crossover and mutation was
For comparison with the WAAC algorithm, a genetic algo-       repeated until 50 offspring were created. The next genera-
rithm for feature selection was implemented in the R sta-     tion was then formed by the 25 best chromosomes in the
tistical programming environment [31]. 50 chromosomes         original population along with the best 25 of the off-
were randomly initialised so that each chromosome on          spring.
average corresponded to a model based on half of the
descriptors. A selection operator chose 10 chromosomes        Results
using tournament selection with tournaments of size 3.        The WAAC algorithm was used to search parameter and
Once selected, that chromosome was removed from the           feature space for a predictive SVM model for the
pool for further selection. A crossover operator was          Karthikeyan dataset for both a PLS model and an SVM




Valuemodel (bottom) function for the best model at each iteration of the WAAC algorithm for the PLS model (top) and the
Figure the
SVM of2 objective
Value of the objective function for the best model at each iteration of the WAAC algorithm for the PLS model
(top) and the SVM model (bottom). The figures on the right, (b) and (d), show the effect of having a single optimisation
phase without any winnowing. Ten repetitions of the algorithm are shown, with corresponding repetitions starting from the
same initial random seed.




                                                                                                                 Page 6 of 15
  Chem. Cent. J. 2008, 2, 21.                                                              (page number not for citation purposes)
Chemistry Central Journal 2008, 2:21                                             http://journal.chemistrycentral.com/content/2/1/21



model. Figures 2(a) and 2(c) show the progress of the                model. The final models were evaluated by training on the
algorithm for the PLS and SVM models respectively, as                entire training data of 2746 molecules, and predicting the
measured by the value of the objective function for the              melting point value of the external test set. The results are
best model found so far in a particular optimisation                 shown in Figure 3 and summarised in Table 2. The sum-
phase. Each experiment was performed 10 times with dif-              mary statistics for the PLS model are: for the training set,
ferent random seeds. For each repetition, the model with             RMSE(tr) = 44.4°C, R2(tr) = 0.52, bias = -0.0°C; for the
the lowest value of the objective function was chosen                test set, RMSE(ext) = 46.6°C, R2(ext) = 0.51, bias = -
from among the best models found in each optimisation                0.74°C. For comparison, the value of the objective func-
phase. Of these ten models, the one with the fewest                  tion RMSE(int) was 42.8°C. There was a single outlier,
descriptors was chosen as the single final model. This               mol4161 (Figure 4). The summary statistics for the SVM
reduces the possibility of finding by chance a model                 model are: for the training set, RMSE(tr) = 30.7°C, R2(tr)
which had an optimal value of the objective function but             = 0.77, bias = -1.6°C; for the test set, RMSE(ext) = 45.1°C,
poor predictive ability.                                             R2(ext) = 0.54, bias = -2.1°C. The value of the objective
                                                                     function RMSE(int) was 40.2°C. Three molecules were
The selected models for WAAC/PLS and WAAC/SVM are                    identified as outliers to the model: mol41, mol4161 and
shown in Table 1. Of the 203 original descriptors, only 68           mol4195. These are drawn as filled circles in Figure 3, and
were selected for the PLS model, and 28 for the SVM                  their structures are shown in Figure 4.




Figure 3
Performance of models developed with WAAC: (a) a PLS model and (b) an SVM model
Performance of models developed with WAAC: (a) a PLS model and (b) an SVM model. The first two columns
contain predictions for the training set and test set, respectively. The line x = y is shown for comparison. The column on the
right shows the residuals from the test set prediction along with a line of best fit (light line); for comparison, the line x = 0 is
shown (heavy line). Outliers are shown as filled circles in the test set prediction and residuals plots. All values in °C.




                                                                                                                           Page 7 of 15
  Chem. Cent. J. 2008, 2, 21.                                                                        (page number not for citation purposes)
Chemistry Central Journal 2008, 2:21                                     http://journal.chemistrycentral.com/content/2/1/21



                                                              calculated the value of the objective function, RMSE(int).
                                                              As shown in Figure 5 (solid line), the value of the objec-
                                                              tive function obtained with 49 components is almost at
                                                              the minimum, although three larger values for the
                                                              number of components give slightly better models
                                                              (42.78°C RMSE(int) versus 42.73°C). For the SVM
                                                              model, the optimised parameter values associated with
                                                              the selected model were a cost value of 5, and a value for
                                                              ε of 0.21. When we carried out a parameter scan across all
                                                              allowed values of the cost and ε (272 models in total),
                                                              only one scored higher than the best model, and even
                                                              then, only marginally: 40.22°C RMSE(int) for cost = 5
                                                              and ε = 0.11, versus 40.23°C for the best model.

                                                              Figure 2(a) shows the value of the objective function for
                                                              the best PLS model at each iteration for the WAAC algo-
                                                              rithm compared to a single optimisation phase without
                                                              any winnowing, Figure 2(b). The same random seeds are
                                                              used for corresponding repetitions of the experiments, to
                                                              ensure that the effect observed is not due to different ini-
                                                              tial models. In the absence of winnowing, premature con-
                                                              vergence occurs and poorer solutions are found. This is
                                                              also the case for the best SVM model shown in Figure 2(c)
                                                              and 2(d).

                                                              The Random Forest (RF) and kNN models for the same
                                                              data are shown in Figure 6 and Table 1. Although per-
                                                              formance on the training set does not give any indication
                                                              of predictive ability, it is interesting to note how the differ-
                                                              ent models have completely different RMSE(tr) and
                                                              R2(tr). Performance on the external test set, which was not
                                                              used to derive any of the models, allows us to assess pre-
                                                              dictive ability. On the basis of RMSE(ext), the RF model
                                                              (44.5°C) is as good as, or slightly better than, the WAAC/
                                                              SVM model (45.1°C), followed by the WAAC/PLS model
                                                              (46.6°C) and then the kNN model (48.3°C). A similar
                                                              order of predictive ability is shown by R2(ext), (RF: 0.55,
                                                              WAAC/SVM: 0.54, WAAC/PLS: 0.51, kNN: 0.47). The bias
                                                              shows a slightly different order for the two WAAC-derived
                                                              models (RF: -0.4°C, WAAC/PLS: -0.7°C, WAAC/SVM: -
Figure 4
Structures of outliers for the models discussed in the text   2.1°C, kNN: -4.1°C).
Structures of outliers for the models discussed in the
text. An outlier is defined as any molecule with a residual   However, looking at the test set predictions in the second
greater than four standard deviations from the mean. Mole-
                                                              column of Figures 3 and 6 it is clear, particularly for the RF
cules 41, 4161 and 4195 are outliers for the WAAC/SVM
model; molecules 4161 and 4208 are outliers for both the RF   model, that a systematic error occurs at the extremes of the
and kNN models; molecule 4161 is the single outlier to the    melting point values in the dataset: low values are system-
WAAC/PLS model.                                               atically overpredicted, while high values are underpre-
                                                              dicted. In order to quantify the extent of this problem, we
                                                              plotted the test set residuals versus the experimental melt-
                                                              ing point, and used linear regression to find the line of
For the PLS model the optimised number of components          best fit (shown in the third column in Figures 3 and 6).
was 49. In order to assess whether the WAAC algorithm         For a model without this type of predictive bias, the
sufficiently explored parameter space, we carried out a       expected slope is 0. The WAAC/SVM model performs best
parameter scan across all allowed values for the parameter    with a slope of -0.43, followed by the kNN and WAAC/
with the feature selection found in the best model, and       PLS models which both have slopes of -0.49, while the RF


                                                                                                                   Page 8 of 15
  Chem. Cent. J. 2008, 2, 21.                                                                (page number not for citation purposes)
Chemistry Central Journal 2008, 2:21                                                       http://journal.chemistrycentral.com/content/2/1/21



Table 1: Description of the best models found by the WAAC algorithm

                              WAAC/PLS                                    WAAC/SVM

 Number of descriptors 68                                                 28

 2D descriptors               petitjean, weinerPath, weinerPol, a_ICM,    radius, weinerPol, b_1rotR, b_rotR, chi1v_c, a_nO, a_nP, balabanJ,
                              b_1rotR, chi0_C, chi1, reactive, a_heavy,   PEOE_VSA+2, PEOE_VSA+3, PEOE_VSA-1, PEOE_VSA-5, PEOE_VSA-6,
                              a_nH, a_nF, a_nO, a_nS, VadjEq, VadjMa,     Q_RPC+, SlogP_VSA1, SlogP_VSA4, SlogP_VSA9, SMR_VSA2, SMR_VSA4,
                              balabanJ, PEOE_RPC+, PEOE_VSA+3,            SMR_VSA6, TPSA
                              PEOE_VSA+4, PEOE_VSA+5,
                              PEOE_VSA+6, PEOE_VSA-1, PEOE_VSA-4,
                              PEOE_VSA_FPNEG, PEOE_VSA_PPOS,
                              PC+, PC-, Q_PC+, Q_RPC+,
                              Q_VSA_FHYD, Q_VSA_FNEG,
                              Q_VSA_FPNEG, Q_VSA_FPOL,
                              Q_VSA_FPOS, Q_VSA_FPPOS,
                              Q_VSA_PNEG, Q_VSA_PPOS, Kier1,
                              Kier3, KierA1, KierA2, apol, vsa_acc,
                              SlogP_VSA3, SlogP_VSA5, SMR_VSA3,
                              SMR_VSA5, TPSA

 3D descriptors               AM1_dipole, AM1_Eele, E_sol, E_strain,      E_oop, E_strain, E_vdw, PM3_LUMO, FASA_P, FCASA+, rgyr
                              E_tor, MNDO_HF, MNDO_dipole,
                              MNDO_E, dipole, PM3_HF, ASA-, ASA_H,
                              CASA-, FASA_H, FASA_P, VSA, glob,
                              std_dim1, std_dim3, vol

 Parameters                   components = 49                             Cost = 5, ε = 0.21




model has a slope of -0.53. The standard errors of all of                      WAAC/PLS, WAAC/SVM, RF and kNN models respec-
these values are 0.01.                                                         tively. However, for the RF model the standard deviation
                                                                               of the predicted values is much smaller than that of the
Another effect of this systematic error is that the predicted                  other models: 47.1, 51.6, 41.0 and 49.5°C for the WAAC/
values are bunched closer around the mean than the                             PLS, WAAC/SVM, RF and kNN models respectively.
experimental values. The mean and standard deviation of
the experimental values in the test set are 167.3°C and                        Another widely used stochastic method for feature selec-
66.4°C, respectively. All of the model predictions have a                      tion is a genetic algorithm (GA). Hasegawa et al. [35] were
similar mean: 166.5, 165.2, 167.0 and 163.2°C for the                          one of the first to use a GA in combination with a PLS

Table 2: Summary statistics for the models discussed in the text

                                                            WAAC/PLS            WAAC/SVM            SVM       kNN        Random Forest

 Training set
 RMSE (°C)                                                  44.4                30.7                36.2      47.6       17.8 (44.7)*
 R2                                                         0.52                0.77                0.68      0.44       0.92 (0.51)*
 bias (°C)                                                  0.0                 -1.6                -2.3      -3.4       0.0
 Test set
 RMSE (°C)                                                  46.6                45.1                43.9      48.3       44.5
 R2                                                         0.51                0.54                0.56      0.47       0.55
 bias (°C)                                                  -0.7                -2.1                -2.3      -4.1       -0.4
 mean (°C)                                                  166.5               165.2               165.0     163.2      167.0
 standard deviation (°C)                                    47.1                51.6                49.3      49.5       41.0

 Line of best fit through test set residuals
 Slope                                                      -0.49               -0.43               -0.44     -0.49      -0.53

 * Out-of-bag estimates for RMSE and R2 are shown in parenthesis.


                                                                                                                                   Page 9 of 15
  Chem. Cent. J. 2008, 2, 21.                                                                                (page number not for citation purposes)
Chemistry Central Journal 2008, 2:21                                       http://journal.chemistrycentral.com/content/2/1/21



                                                                supposed to help strike a balance between exploitation of
                                                                information on previous models (global search) and
                                                                exploration of local feature space (local search). However,
                                                                this aspect is already included in Shen et al.'s algorithm
                                                                and WAAC by the influence of the best models (global
                                                                search) and current models (local search) on the moving
                                                                probabilities. As a result of this simpler approach, the
                                                                moving probabilities now have a meaningful interpreta-
                                                                tion: the probability of choosing a particular descriptor in
                                                                the next iteration is equal to the average of the fraction of
                                                                ants that have chosen that descriptor in their current
                                                                model and the fraction of ants that have chosen it in their
                                                                best model.

                                                                Since the WAAC algorithm requires a range of allowed
                                                                parameter values for the model, it is generally worthwhile
Figure PLS model
ability of5a of the number of components on the predictive
The effect                                                      to do an exploratory run of the algorithm to determine
The effect of the number of components on the pre-              reasonable values. In addition, it is important that the
dictive ability of a PLS model. The red dashed line is a        number of allowed values for each parameter is less than
model based on all of the features, whereas the model repre-    the number of ants (preferably much less) to ensure that
sented by the blue solid line is based only on the subset       the parameter space is adequately sampled. An appropri-
selected by the WAAC algorithm. The best subset line ends       ate size for the ant population depends on the number of
at 59 components, as there are only 59 features in this sub-    descriptors and the extent of the interaction between
set. The line for all features is truncated at 174 components   them. Model space will be better sampled if more ants are
as the RMSE rapidly increases after this point.
                                                                used, but the calculation time will also increase. However,
                                                                since the feature-selection space is of size 2n-1, where n is
                                                                the number of descriptors, the exact number of ants is not
model to perform feature selection. The performance of          expected to affect the ability of the algorithm to find solu-
the GA for feature selection is shown in Figure 7 compared      tions. An ant population of between 50 and 100 ants is
to the WAAC algorithm. For both algorithms, the number          recommended. For the WAAC/PLS study, the relationship
of PLS components was fixed at 49. Convergence is much          between the population size and the best value of the
slower for the GA algorithm. In addition, the model with        objective function is shown in Figure 8; there is little
the fewest number of descriptors from 10 repetitions of         improvement beyond 50 ants. The length of the optimisa-
each algorithm had 95 descriptors in the case of GA/PLS         tion phase should be sufficient to allow the objective
(objective function of 42.6°C) but only 57 for WAAC/PLS         function to start to converge to an optimum value. It is not
(objective function value of 42.3°C).                           necessary to allow the optimisation phase to proceed
                                                                much further, as after this point the descriptors chosen in
Discussion                                                      the best models reinforce themselves and broad sampling
The development of the WAAC algorithm arose from an             of the search space no longer occurs. The winnowing pro-
attempt to overcome the limitations of the modified ACO         cedure and subsequent reinitialisation on a smaller search
and ANTSELECT algorithms. Both of these algorithms              space is a more effective way of finding the optimum
determine probabilities by summing weights based on fit-        model.
ness scores. However, we observed that as convergence is
achieved the fitness scores of the ant models in a particu-     In the past, the development and comparison of feature
lar iteration differ very little from each other. Thus, WAAC    selection methods for QSAR have involved the use of a
uses the fraction of the number of ants that have chosen a      standard dataset first reported in 1990, the Selwood data-
particular descriptor rather than a function of the fitness     set [36] of the activity of 31 antifilarial antimycin ana-
of the ants that have chosen that feature. Another problem      logues, whose structures are represented by 53 calculated
with the use of weights is that they increase monotonically     physicochemical descriptors. However, comparisons
over the course of the algorithm whereas the sum of the         between different algorithms have been hampered by the
number of ants has a clear bound. In addition, WAAC uses        fact that many of the descriptors are highly-correlated,
a value for ρ of 1, that is, complete evaporation. Values       and in addition, a true test using an external test set is not
less than 1 were found to delay convergence without any         feasible due to the small number of samples. Advances in
corresponding improvement in the result. This makes             computing power mean that it is no longer appropriate to
sense when we consider that the evaporation parameter is        use such a small dataset for the purposes of testing feature


                                                                                                                   Page 10 of 15
  Chem. Cent. J. 2008, 2, 21.                                                                 (page number not for citation purposes)
Chemistry Central Journal 2008, 2:21                                              http://journal.chemistrycentral.com/content/2/1/21




Performance of (a) a kNN model, and (b) a Random Forest model
Figure 6
Performance of (a) a kNN model, and (b) a Random Forest model. The first two columns contain predictions for the
training set and test set, respectively. The line x = y is shown for comparison. The column on the right shows the residuals
from the test set prediction along with a line of best fit (light line); for comparison, the line x = 0 is shown (heavy line). Outliers
are shown as filled circles in the test set prediction and residuals columns. All values in °C.



selection algorithms. The Karthikeyan dataset used here is            then the performance of the PLS method is likely to suffer.
much more representative of the feature selection prob-               This may explain why, despite containing fewer than half
lems that occur in modern QSAR and QSPR studies.                      the number of descriptors, the SVM model performed bet-
                                                                      ter than the PLS model.
PLS models are prone to overfitting. Figure 5 shows a
comparison between a PLS model that uses the best subset              Although the WAAC algorithm is capable of simultane-
(as selected by WAAC) and one using all of the descrip-               ously optimising the feature selection as well as the
tors. It is clear that the development of a predictive PLS            parameter values, in some instances it may be preferable
model requires a variable selection step. Even if the                 to use the WAAC algorithm simply for feature selection
number of components is optimised, performance is sig-                and optimise the parameter values separately for each
nificantly poorer if all features are used instead of just the        model. This will only be computationally feasible where
subset selected by the WAAC algorithm. It is also worth               the model has a small number of parameters which need
noting that PLS is a linear method, whereas SVM is a non-             to be optimised and where the parameter optimisation
linear method. If the underlying link between descriptor              can be efficiently carried out. For example, the optimal
values and the melting point cannot be adequately                     number of components for a PLS model could be deter-
described by a linear combination of descriptor values,               mined by internal cross validation. When compared to


                                                                                                                           Page 11 of 15
  Chem. Cent. J. 2008, 2, 21.                                                                         (page number not for citation purposes)
Chemistry Central Journal 2008, 2:21                                      http://journal.chemistrycentral.com/content/2/1/21




Figure the
Value of7 objective function for the best PLS model at each iteration of (a) a genetic algorithm and (b) the WAAC algorithm
Value of the objective function for the best PLS model at each iteration of (a) a genetic algorithm and (b) the
WAAC algorithm. Ten repetitions of each algorithm are shown. The number of PLS components was set to 49.


the use of a genetic algorithm for optimising the feature      ent implementations as well as several parameters. This
selection of a PLS model, the WAAC algorithm performs          result, on a single dataset, cannot therefore be seen as con-
well, both in terms of faster convergence and in its ability   clusive.
to produce models with fewer descriptors. It should be
noted, however, that genetic algorithms have many differ-      In comparison to PLS models, the inclusion of a large
                                                               number of descriptors does not necessarily lead to overfit-
                                                               ting for SVM models. Although both Guyon et al. [14] and
                                                               Fröhlich et al. [15], for example, have developed descrip-
                                                               tor selection methods for SVM, an SVM model built on
                                                               the entire set of descriptors and using the optimized
                                                               parameters from the WAAC algorithm actually performs
                                                               slightly better on the external test set. Here, the main effect
                                                               of the WAAC algorithm is the identification of a mini-
                                                               mum subset of descriptors which are the most important
                                                               for the development of a predictive model. Such a proce-
                                                               dure is especially useful when the descriptor values are
                                                               derived from experimental measurement or require
                                                               expensive calculation (for example, those derived from
                                                               QM calculations). It also aids interpretability of the
                                                               results.

                                                               Of the 28 descriptors selected by the WAAC/SVM model,
Figure the
value of 8 between the population WAAC/PLS model
Relationship objective function for thesize and the minimum    three-quarters are 2D descriptors. Of these, many involve
Relationship between the population size and the               the area of the van der Waals surface associated with par-
minimum value of the objective function for the                ticular property values. For example the PEOE_VSA+2
WAAC/PLS model. The value of the objective function is         descriptor is the van der Waals surface area (VSA) associ-
the minimum found from ten repetitions of the algorithm.       ated with PEOE (Partial Equalisation of Orbital Electron-
                                                               egativity) charges in the range 0.10 to 0.15. Also selected


                                                                                                                   Page 12 of 15
  Chem. Cent. J. 2008, 2, 21.                                                                 (page number not for citation purposes)
Chemistry Central Journal 2008, 2:21                                        http://journal.chemistrycentral.com/content/2/1/21



were descriptors relating to hydrophobic patches on the          mol4195, m.p. 342°C, but predicted 111°C. Both of these
VSA (SlogP_VSA1, for example), the contribution to               molecules have extended conjugated structures, causing
molar refractivity (SMR_VSA2, for example) which is              the molecule to be planar over a wide area, and which are
related to polarisability, and the polar surface area (TPSA).    likely to give rise to extensive π-π stacking in the solid
Since the intermolecular interactions in a crystal lattice are   state. As a result, they are conformationally less flexible
dependent on complementarity between the properties of           than might be expected from the number of rotatable
the VSA of adjacent molecules, the selection of these            bonds. mol4161 is also an outlier to the other three mod-
descriptors seems reasonable. Two descriptors were               els; for WAAC/PLS it is the only outlier, whereas the RF
selected relating to the number of rotatable bonds               and kNN predictions have a second outlier, mol4208 (Fig-
(b_1rotR and b_rotR). These properties are related to the        ure 4).
melting point through their effect on the change in
entropy (ΔSfus) associated with the transformation to the        The WAAC algorithm described here is particularly useful
solid state. Hydrogen bonds make an important energetic          when a machine learning method is prone to overfitting if
contribution to the formation of the crystal structure. This     presented with a large number of descriptors, such as is
probably explains the selection of the descriptor for the        the case with PLS. However, not all machine learning
number of oxygen atoms (a_nO), although strangely the            methods require a prior feature selection procedure. The
number of nitrogen atoms is not included (it was however         Random Forest (RF) method of Breiman uses consensus
included in five out of the ten ant models). Four descrip-       prediction of multiple decision trees built with subsets of
tors were selected by all ten ant models: b_1rotR,               the data and descriptors to avoid overfitting. For compar-
SlogP_VSA1, PEOE_VSA-6 and balabanJ. Balaban's J                 ison with the WAAC results, we predicted the melting
index is a topological index that increases in value as a        point values for the external test data using an RF model
molecule becomes more branched [37]. It seems possible           built on the training data. We also compared to a 15 Near-
that increased branching makes packing more difficult,           est Neighbour model (kNN) where the predictions of the
and leads to lower melting points.                               set of neighbours were combined using an exponential
                                                                 weighting. In our comparison, the RMSE(ext) and R2(ext)
The WAAC algorithm appears to be robust to the presence          show that the RF and WAAC/SVM models are very similar,
of highly correlated descriptors. Despite the fact that such     and are better than the WAAC/PLS and kNN models.
descriptors were not filtered from the dataset, the selected     However, analysis of the residuals shows that the RF is
WAAV/SVM model contains only two pairs of descriptors            more prone to bias at high and low values of the melting
with an absolute Pearson correlation coefficient greater         point compared to the other models.
than 0.8: b_rotR/b_1rotR (0.97) and SMR_VSA2/
PEOE_VSA-5 (0.81). If the WAAC algorithm were unable             A predictive bias was observed for all models at the
to filter highly correlated descriptors, we would expect to      extremes of the range of melting points. A similar effect
see many more correlations as 16 of the chosen descrip-          was observed by Nigsch et al. for a kNN model of melting
tors were highly correlated (absolute value greater than         point prediction [33]. The effect was attributed to the fact
0.8) with at least one descriptor not included in the final      that the density of points in the training set is less at the
model. For example, radius has a correlation of 0.86 with        extremes of the range of melting point values. This means
respect to diameter (not unexpectedly). weinerPol is             that the nearest neighbours to a point near the extreme are
highly correlated with 35 other descriptors, none of which       more likely to have melting points closer to the mean.
were chosen in the final model. PM3_LUMO is correlated           This effect is most pronounced for the RF model, and the
with both AM1_LUMO (0.97) and MNDO_LUMO                          explanation may be similar.
(0.96), but neither of other two appear.
                                                                 In this study the WAAC algorithm was guided using the
For a small number of molecules, our models make very            RMSE of prediction for an internal test set, RMSE(int). The
poor predictions. This may either be due to a lack of suffi-     choice of which objective function to use should be con-
cient training molecules with particular characteristics, or     sidered carefully. If an objective function is chosen which
it may be due to a fundamental deficiency in the informa-        does not explicitly penalise the number of descriptors but
tion used to build the models. For example, for the              only does so implicitly (for example, RMSE(int)), irrele-
WAAC/SVM models, three outliers can be detected whose            vant descriptors may accumulate in the converged model.
residuals are more than four standard deviations from the        When using such an objective function, the winnowing
mean (Figures 3 and 4). A polyfluorinated amide, mol41,          procedure implemented in WAAC plays an important role
is predicted to have a melting point of 233°C although its       in removing these descriptors after the optimisation phase
experimental melting point is 44°C. The melting points of        by initiating a new search of a reduced feature space which
the other two outliers were both underestimated:                 makes it less likely that irrelevant descriptors will be
mol4161, m.p. 314.5°C but predicted 119°C, and                   selected. This effect is shown in Figure 2(b) and 2(d),


                                                                                                                    Page 13 of 15
  Chem. Cent. J. 2008, 2, 21.                                                                  (page number not for citation purposes)
Chemistry Central Journal 2008, 2:21                                          http://journal.chemistrycentral.com/content/2/1/21



where poorer models were found when the WAAC feature             We have presented WAAC, an extension of the modified
selection and parameter optimisation procedure was               ACO algorithm of Shen et al. [26], which can perform
applied without winnowing.                                       simultaneous optimisation of feature selection and model
                                                                 parameters. In addition, the moving probabilities used by
An alternative type of objective function is one that explic-    the algorithm are easily interpreted in terms of the best
itly penalises the number of descriptors. Such functions         and current models of the ants, and our winnowing pro-
typically contain a cost term which is adjusted based on         cedure promotes the removal of irrelevant descriptors.
some a priori knowledge of the number of descriptors
desired in the model. For example, the modified ACO              We have shown that the WAAC algorithm can be used to
algorithm of Shen et al. [26] was guided by a fitness func-      simultaneously optimise parameter values and the
tion with two terms, one relating to the number of               selected features for PLS and SVM models for melting
descriptors and the other to the fit of the model to the         point prediction. In particular, the resulting SVM model
training set. Objective functions such as this quickly force     based on 28 descriptors performed as well as a Random
models into a reduced feature space by favouring models          Forest model that used the entire set of 203 descriptors.
with fewer descriptors. However, the moving probabilities
used to choose descriptors will be misleading as they will       Authors' contributions
largely be based on those descriptors present in models          NMOB conceived and developed the WAAC algorithm,
with fewer descriptors rather than those with the best pre-      applied it to the melting point dataset, analysed the results
dictive ability. As a result, descriptors with good predictive   and drafted the manuscript. DSP was involved in the
ability may be removed by chance. It should be noted that        interpretation of the results, revising the manuscript and
an objective function that simply optimises a measure of         carried out the Random Forest calculations. FN imple-
fit to the training data is not a suitable choice for the        mented the kNN model. JBOM contributed to the analysis
development of a model with predictive ability. Optimis-         of data and revising the manuscript. All authors read and
ing the RMSE on the entire training data, RMSE(tr), or           approved the final manuscript.
optimising the R2(tr) value, will produce an overfitted
model that fits the training data exceptionally well but         Additional material
performs poorly on unseen data.

Near the end of each optimisation phase, the majority of          Additional file 1
                                                                  The external test set. The models were evaluated by testing on this exter-
ants converge to the same feature selection and parameter         nal test set.
values, causing the same model to be repeatedly evalu-            Click here for file
ated. It should be possible to gain a significant speedup if      [http://www.biomedcentral.com/content/supplementary/1752-
instead of re-evaluating a model, a cached value were             153X-2-21-S1.csv]
used. Caching could be simply done by storing the objec-
tive function and models for all of the ants from the last        Additional file 2
few iterations. This is especially important if an objective      The internal training set. The WAAC feature selection algorithm was
                                                                  trained on this.
function is used whose value varies on re-evaluation as is        Click here for file
the case, for example, with the RMSE from n-fold cross-           [http://www.biomedcentral.com/content/supplementary/1752-
validation, RMSE(cv). Since for each ant the best score is        153X-2-21-S2.csv]
retained, the value of the objective function will tend
towards the optimistic tail of the distribution of values of      Additional file 3
the RMSE(cv). However, it should not have a major effect          The internal test set. The objective function used to guide the WAAC fea-
on the results of the feature selection and parameter opti-       ture selection algorithm was calculated using this internal test set.
                                                                  Click here for file
misation, as model re-evaluation generally occurs only
                                                                  [http://www.biomedcentral.com/content/supplementary/1752-
once the majority of the ants' models have already con-           153X-2-21-S3.csv]
verged.

Conclusion
The key elements to developing an effective QSPR model           Acknowledgements
for prediction are accurate data, relevant descriptors and       We thank the BBSRC (NMOB and JBOM – grant BB/C51320X/1), Pfizer
an appropriate model. Where there is no a priori informa-        (DSP and JBOM – through the Pfizer Institute for Pharmaceutical Materials
tion available on relevant descriptors, some form of fea-        Science), and Unilever for funding FN and JBOM and for supporting the
ture selection needs to be performed.                            Centre for Molecular Science Informatics. NMOB thanks Dr. Jen Ryder, Dr.
                                                                 Daniel Almonacid and Dr. Avril Coghlan for helpful comments on the man-
                                                                 uscript.




                                                                                                                           Page 14 of 15
  Chem. Cent. J. 2008, 2, 21.                                                                         (page number not for citation purposes)
Chemistry Central Journal 2008, 2:21                                                             http://journal.chemistrycentral.com/content/2/1/21




References                                                                        27.    Goss S, Aron S, Deneubourg JL, Pasteels JM: Self-organized short-
1.    Hansch C, Maloney PP, Fujita T, Muir RM: Correlation of biologi-                   cuts in the Argentine ant. Naturwissenschaften 1989, 76:579-581.
      cal activity of phenoxyacetic acids with Hammett substitu-                  28.    Dorigo M, Di Caro G, Gambardella LM: Ant algorithms for dis-
      ent constants and partition coefficients.                Nature 1962,              crete optimization. Artif Life 1999, 5:137-172.
      194:178-180.                                                                29.    Izrailev S, Agrafiotis DK: Variable selection for QSAR by artifi-
2.    Hawkins DM: The problem of overfitting. J Chem Inf Comput Sci                      cial ant colony systems. SAR QSAR Environ Res 2002, 13:417-423.
      2004, 44:1-12.                                                              30.    Gunturi SB, Narayanan R, Khandelwal A: In silico ADME model-
3.    Guyon I, Elisseeff A: An introduction to variable and feature                      ling 2: Computational models to predict human serum albu-
      selection. J Mach Learn Res 2003, 3:1157-1182.                                     min binding affinity using ant colony systems. Bioinorg Med
4.    Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E:                Chem 2006, 14:4118-4129.
      The Chemistry Development Kit (CDK): An Open-Source                         31.    R: A Language and Environment for Statistical Computing
      Java Library for Chemo- and Bioinformatics. J Chem Inf Comput                      2006 [http://www.R-project.org]. R Foundation for Statistical Com-
      Sci 2003, 43:493-500.                                                              puting, Vienna, Austria
5.    MOE (Molecular Operating Environment), v2004.03 [http://www.chem            32.    Karthikeyan M, Glen RC, Bender A: General melting point pred-
      comp.com]. Chemical Computing Group Inc., Montreal, Quebec,                        ication based on a diverse compound data set and artificial
      Canada                                                                             neural networks. J Chem Inf Model 2005, 45:581-590.
6.    2006 [http://www.tripos.com]. SYBYL 7.1. Tripos Inc., 1699 Hanley           33.    Nigsch F, Bender A, van Buuren B, Tissen J, Nigsch E, Mitchell JBO:
      Road, St. Louis, MO 63144                                                          Melting point prediction employing k-nearest neighbour
7.    John GH, Kohavi R, Pfleger K: Irrelevant features and the subset                   algorithms and genetic parameter optimization. J Chem Inf
      selection problem. In Machine learning, Proceedings of the Eleventh                Model 2006, 46:2412-2422.
      International Conference: 10–13 July 1994; Amherst Edited by: Cohen         34.    Breiman L: Random Forests. Mach Learn 2001, 45:5-32.
      WW, Hirsh H. Morgan Kaufmann; 1994:121-129.                                 35.    Hasegawa K, Miyashita Y, Funatsu K: GA Strategy for Variable
8.    Kohavi R, John GH: Wrappers for feature subset selection. Artif                    Selection in QSAR Studies: GA-Based PLS Analysis of Cal-
      Intell 1997, 97:273-324.                                                           cium Channel Antagonists. J Chem Inf Comput Sci 1997,
9.    Dudek AZ, Arodz T, Gálvez J: Computational methods in devel-                       37:306-310.
      oping quantitative structure-activity relationships (QSAR): a               36.    Selwood DL, Livingstone DJ, Comley JCW, O'Dowd AB, Hudson AT,
      review. Comb Chem High Through Screen 2006, 9:213-228.                             Jackson P, Jandu KS, Rose VS, Stables JN: Structure-activity rela-
10.   Liu Y: A comparative study on feature selection methods for                        tionships of antifilarial antimycin analogues: a multivariate
      drug discovery. J Chem Inf Comput Sci 2004, 44:1823-1828.                          pattern recognition study. J Med Chem 1990, 33:136-142.
11.   Whitney AW: A direct method of nonparametric measure-                       37.    Balaban AT: Highly discriminating distance-based topological
      ment selection. IEEE Trans Comput 1971, 20:1100-1103.                              index. Chem Phys Lett 1982, 89:399-404.
12.   Marill T, Green DM: On the effectiveness of receptors in recog-
      nition systems. IEEE Trans Inform Theory 1963, 9:11-17.
13.   Pudil P, Novovièová J, Kittler J: Floating search methods in fea-
      ture selection. Patt Recog Lett 1994, 15:1119-1125.
14.   Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer
      classification using support vector machines. Mach Learn 2002,
      46:389-422.
15.   Fröhlich H, Wegner JK, Zell A: Towards optimal descriptor sub-
      set selection with support vector machines in classification
      and regression. QSAR Comb Sci 2004, 23:311-318.
16.   Goldberg DE: Genetic Algorithms in Search, Optimization and Machine
      Learning Boston: Kluwer Academic Publishers; 1989.
17.   Rogers D, Hopfinger AJ: Application of genetic function approx-
      imation to quantitative structure-activity relationships and
      quantitative structure-property relationships. J Chem Inf Com-
      put Sci 1994, 34:854-866.
18.   Wegner JK, Zell A: Prediction of aqueous solubility and parti-
      tion coefficient optimized by a genetic algorithm based
      descriptor selection method. J Chem Inf Comput Sci 2003,
      43:1077-1084.
19.   von Homeyer A: Evolutionary Algorithms and their Applica-
      tions in Chemistry. In Handbook of Chemoinformatics Volume 3.
      Edited by: Gasteiger J. Weinheim: Wiley-VCH; 2003:1239-1280.
20.   Agrafiotis DK, Cedeno W: Feature selection for structure-activ-
      ity correlation using binary particle swarms. J Med Chem 2002,
      45:1098-1107.
21.   Lin WQ, Jiang JH, Shen Q, Shen GL, Yu RQ: Optimized block-wise
      variable combination by particle swarm optimization for
      partial least squares modeling in quantitative structure-                     Publish with ChemistryCentral and every
      activity relationship studies. J Chem Inf Model 2005, 45:486-493.
22.   Guha R, Jurs PC: Development of linear, ensemble and nonlin-                  scientist can read your work free of charge
      ear models for the prediction and interpretation of the bio-
      logical activity of a set of PDGFR inhibitors. J Chem Inf Comput                        Open access provides opportunities to our
      Sci 2004, 44:2179-2189.                                                             colleagues in other parts of the globe, by allowing
23.   Vapnik VN: The nature of statistical learning theory New York: Springer                 anyone to view the content free of charge.
      Verlag; 1995.
24.   Hastie T, Tibshirani R, Friedman J: The elements of statistical learning:                                 W. Jeffery Hurst, The Hershey Company.
      data mining, inference, and prediction New York: Springer; 2001.
                                                                                        available free of charge to the entire scientific community
25.   Smola AJ, Schölkopf B: A tutorial on support vector regression.
      Stat Comput 2004, 14:199-222.                                                     peer reviewed and published immediately upon acceptance
26.   Shen Q, Jiang JH, Tao JC, Shen GL, Yu RQ: Modified Ant Colony                     cited in PubMed and archived on PubMed Central
      Optimization Algorithm for Variable Selection in QSAR                             yours you keep the copyright
      Modeling: QSAR Studies of Cyclooxygenase Inhibitors. J
      Chem Inf Model 2005, 45:1024-1029.                                           Submit your manuscript here:
                                                                                   http://www.chemistrycentral.com/manuscript/




                                                                                                                                                Page 15 of 15
     Chem. Cent. J. 2008, 2, 21.                                                                                           (page number not for citation purposes)
My Open Access papers
Part IV

The Rest




   69
My Open Access papers
BMC Bioinformatics                                                                                                                       BioMed Central



 Software                                                                                                                               Open Access
 Userscripts for the Life Sciences
 Egon L Willighagen*1, Noel M O'Boyle2, Harini Gopalakrishnan3,
 Dazhi Jiao3, Rajarshi Guha3, Christoph Steinbeck4 and David J Wild3

 Address: 1Cologne University Bioinformatics Center, Cologne University, Cologne, Germany, 2Cambridge Crystallographic Data Centre,
 Cambridge, UK, 3School of Informatics, Indiana University, Bloomington, USA and 4Wilhelm-Schickard-Institut, Center for Bioinformatics,
 University of Tübingen, Tübingen, Germany
 Email: Egon L Willighagen* - egonw@users.sf.net; Noel M O'Boyle - baoilleach@gmail.com; Harini Gopalakrishnan - hgopalak@indiana.edu;
 Dazhi Jiao - djiao@indiana.edu; Rajarshi Guha - rguha@indiana.edu; Christoph Steinbeck - c.steinbeck@steinbeck-molecular.de;
 David J Wild - djwild@indiana.edu
 * Corresponding author




 Published: 21 December 2007                                                    Received: 31 August 2007
                                                                                Accepted: 21 December 2007
 BMC Bioinformatics 2007, 8:487   doi:10.1186/1471-2105-8-487
 This article is available from: http://www.biomedcentral.com/1471-2105/8/487
 © 2007 Willighagen et al; licensee BioMed Central Ltd.
 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
 which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.




                  Abstract
                  Background: The web has seen an explosion of chemistry and biology related resources in the
                  last 15 years: thousands of scientific journals, databases, wikis, blogs and resources are available
                  with a wide variety of types of information. There is a huge need to aggregate and organise this
                  information. However, the sheer number of resources makes it unrealistic to link them all in a
                  centralised manner. Instead, search engines to find information in those resources flourish, and
                  formal languages like Resource Description Framework and Web Ontology Language are
                  increasingly used to allow linking of resources. A recent development is the use of userscripts to
                  change the appearance of web pages, by on-the-fly modification of the web content. This opens
                  possibilities to aggregate information and computational results from different web resources into
                  the web page of one of those resources.
                  Results: Several userscripts are presented that enrich biology and chemistry related web
                  resources by incorporating or linking to other computational or data sources on the web. The
                  scripts make use of Greasemonkey-like plugins for web browsers and are written in JavaScript.
                  Information from third-party resources are extracted using open Application Programming
                  Interfaces, while common Universal Resource Locator schemes are used to make deep links to
                  related information in that external resource. The userscripts presented here use a variety of
                  techniques and resources, and show the potential of such scripts.
                  Conclusion: This paper discusses a number of userscripts that aggregate information from two or
                  more web resources. Examples are shown that enrich web pages with information from other
                  resources, and show how information from web pages can be used to link to, search, and process
                  information in other resources. Due to the nature of userscripts, scientists are able to select those
                  scripts they find useful on a daily basis, as the scripts run directly in their own web browser rather
                  than on the web server. This flexibility allows the scientists to tune the features of web resources
                  to optimise their productivity.




                                                                                                                                        Page 1 of 12
BMC Bioinformatcs. 2007, 8, 487.                                                                                  (page number not for citation purposes)
BMC Bioinformatics 2007, 8:487                                               http://www.biomedcentral.com/1471-2105/8/487




Background                                                       as identifiers [13], indicating that a specific database entry
The web has seen an explosion of chemistry and biology           is related to the cited term in the ontology, and therefore
related resources in the last 15 years: thousands of scien-      related to entries from other databases annotated with
tific journals, databases, wikis, blogs, and regular HTML        that term.
pages are available containing information relevant to
chemists and biologists [1-4]. While each of those               Identifiers that can be calculated algorithmically are even
resources is valuable in itself, integrating information         better, because they do not need to be looked up in a list
from these resources increases the value even more: for          of identifiers. Instead, anyone can calculate them from the
example, PubChem provides a wealth of data but could be          object itself. For example, for molecular structures the
complemented with 3D models to create an even richer             InChI [14] is the ideal replacement for database specific
information source.                                              identifiers such as the CAS registration number, the
                                                                 PubChem compound identifier and the ChEBI identifier.
The original goal of the world wide web was to hyperlink         These all require a look up or conversion table to convert
individual web pages allowing humans to explore a web            one identifier into another. Using the InChI, one can look
of knowledge. For individual web pages these links can be        up information in all databases without having to know
created manually, as is still done in blogs, wikis, and static   the database specific identifier.
HTML pages; for large databases this is, however, not fea-
sible. Userscripts are small programs that can alter the         In addition to the unique identifier, one additional func-
HTML content rendered by web browsers. For example, a            tionality is needed to create a link to a particular database:
userscript may add book prices from competitors to the           the database must provide either an API (Application Pro-
Amazon.com website, or may remove unwanted adver-                gramming Interface) which can be queried using the iden-
tisements from a site. Using the same approach, user-            tifier or else provide a uniform scheme for deep linking to
scripts can also solve the problem of interlinking web           a web page containing information about the entry
resources, by adding to web pages of one resource dynam-         behind the identifier. For example, looking up structures
ically generated hyperlinks into another. By selecting a         in PubChem is done with a scheme in which the InChI is
specific set of userscripts, the user can tune a website to      embedded verbatim. To look up the structure of methane
provide all kinds of facilities not anticipated by the origi-    (InChI=1/CH4/h1H4), the URL
nal author of the site. For example, userscripts have been       http:www.ncbi.nlm.nih.gov/entrequery.fcgi?CMD=searc
used in bioinformatics to enhance the iHOP web page [5]:         hDB=pccompoundterm=%22InChI=1/CH4/h1H4%2
the script extracts user assigned tags from a third party        2[InChI] is used.
resource, and shows them as a tag cloud on iHOP pages
for particular genes.                                            The plethora of resources is overwhelming, and both users
                                                                 and database developers may have preferred subsets, e.g.
Automatic hyperlinking is only possible though the use of        more trusted, resources. It is therefore worthwhile to have
unique identifiers such as the PDB ID, the CAS registra-         a system that allows users to choose which resources they
tion number and, more recently, the IUPAC International          want to have linked with which other resources. User-
Chemical Identifier (InChI). While identifiers are easily        scripts provide the necessary technology to allow this
used to connect databases, such as done in the SRS system        within web browsers. Here we describe several userscripts
[6] or in meta database software like BioWarehouse [7],          we have developed to create links between web resources
the sheer number of web resources makes it impossible            of interest to researchers in the life sciences.
integrate all resources. Consequently, (bio)chemical
search engines, such as ChemSpider [8] and tools to har-         Implementation
vest information from web resources, such as ChemX-              We use the following techniques to link various web
treme [9] and BioSpider [10,11], as well as systems that         resources in this paper: userscripts, unique identifiers,
standardize algorithmic access to resources and services,        microformats, and web resource interfaces. The following
such as BioMOBY [12], have emerged.                              sections describe how these are used in this work.

Another reason why identifiers do not always allow link-         Userscripts
ing resources is that many of them are database specific,        A userscript is a small program written in JavaScript that is
such as the PDB ID and the Digital Object Identifier             automatically run within a web browser (often by a plugin
(DOI), and sometimes even restricted in being used, as           or add-on) when the user accesses pages that match a par-
with the CAS registration number. Open standard identi-          ticular URL. Userscripts allow the user to modify the
fiers address this problem. Such identifiers can be derived      HTML content of a web page on-the-fly, by adding or
from ontologies, dictionaries, encyclopedia, or computed         removing elements or by moving them around. For exam-
by an algorithm. The Gene Ontology terms are often used          ple, userscripts exist that remove pop-up advertisements


                                                                                                                     Page 2 of 12
  BMC Bioinformatcs. 2007, 8, 487.                                                             (page number not for citation purposes)
BMC Bioinformatics 2007, 8:487                                                    http://www.biomedcentral.com/1471-2105/8/487



from web pages, and that alter the Amazon.com web page               expressions to find certain strings in the text of the web
to provide book prices from alternative suppliers. A repos-          page. This works particularly well for identifiers with a
itory of userscripts exists at userscripts.org/citeuserscripts-      unique and well described syntax. For example, a regular
dotorg. Chemists and biologists can find relevant                    expression for InChIs will have fewer false positives than
userscripts by searching with the terms chemistry or               one for PDB identifiers.
biology.
                                                                     As with any program that you run on your computer, it is
Of the popular web browsers, only Opera provides built-              important to consider security when installing userscripts.
in support for userscripts (referred to as 'User JavaScript').       Although the security model used by Greasemonkey pre-
To enable userscript support in other browsers, a third-             vents attacks by malicious websites, it is unable to detect
party extension needs to be installed: Greasemonkey [15]             or prevent the user himself installing a malicious user-
for Firefox, Creammonkey [16] for Safari, IE7pro [17] or             script. Such scripts do exist; recently, malicious userscripts
Turnabout [18] for Internet Explorer. The userscripts pre-           were uploaded to Userscripts.org that attempted to steal
sented in the Results section are targeted at Greasemon-             information from users' cookies. In that case, once the
key, although it should be possible to run them in any               problem was discovered the malicious userscripts were
browser with only minor changes.                                     easily detected and removed by the administrator. We rec-
                                                                     ommend that unless you are familiar with JavaScript and
The web browser user has full control over which user-               carefully inspect the source code, you should only install
scripts she wants to have installed, allowing her to cus-            userscripts from a trusted source.
tomise web pages exactly the way she wishes. Once
installed, it is possible to individually enable or disable          Unique identifiers
installed scripts. For example, for Greasemonkey see the             Recognition of biological and chemistry relevant informa-
Manage User Scripts option in the Tools menu under                 tion on web pages is simplified by using identifiers [19].
Greasemonkey, or to disable the extension completely,              Such identifiers may or may not be marked up with
click on the Greasemonkey icon in the status bar. Further            semantic markup such as microformats (see below). Iden-
control is provided by specifying to which web pages the             tifiers are widely used to make connections between data-
script applies. Userscripts define default rules (e.g. http://       bases, and often identify a specific entry in a database.
www.biomedcentral.com/), but the user is normally able               Some examples of this are the PDB identifier, Digital
to override these.                                                   Object Identifiers, PubChem compound identifier, and
                                                                     the CAS registry number for, respectively the PDB, DOI,
The userscript has two main methods to find the HTML                 PubChem, and the Chemistry Abstract Service databases.
content to which to add or remove elements. The most                 In this study we use DOIs, InChIs, and PDB identifiers as
accurate one is to analyse the document object model                 our unique identifiers (Table 1).
(DOM). This approach is used by the Sechemtic userscript
to find uses of chemical microformats (see example below
under Microformats). The other method is to use regular

Table 1: Userscripts for the life sciences. A summary of the resources and identifiers used by userscripts for the life sciences. The
Identification method indicates how the userscript recognises relevant information on a web page. The Identifiers column describes
the unique identifier searched for. The Resources column indicates the web resource to which a link is created, or from which data is
extracted.

                                                                   Technologies and Resources used

 Name                            Identification method              Identifiers                        Resources

 Jmol4PubChem                    HTML tags on PubChem               PubChem ID                         Pub3D [36]
 OSCAR3 on HTML                  natural language processing        chemical structure name            -
 PDB-Jmol                        regular expression                 PDB ID                             First Glance in Jmol [41]
 Sechemtic                       microformats                       InChI, SMILES, CAS number          PubChem [32]
                                                                                                       eMolecules [43]
                                                                                                       Google [54]
 Add quotes to DOIs              regular expression                 DOI                                Postgenomic [3]
                                                                                                       Chemical blogspace [4]
 Add quotes to molecules         microformats                       InChI                              Chemical blogspace [4]
 Add to Connotea                 regular expression                 DOI                                Connotea [30]




                                                                                                                           Page 3 of 12
  BMC Bioinformatcs. 2007, 8, 487.                                                                   (page number not for citation purposes)
BMC Bioinformatics 2007, 8:487                                             http://www.biomedcentral.com/1471-2105/8/487



Microformats                                                          %22[InChI];
Microformats [20] are a lightweight specification that
extends HTML to add semantic markup to web pages. For            newElement.innerHTML =
example, hCard is a microformat that allows semantic
mark up of address information [21], and hCalendar is a               supPubChem/sup;
microformat specification for the representation of calen-
dar information about events [22].                               spanElement.parentNode.insertBefore(

A microformat specification has also been suggested for               newElement, spanElement.nextSibling
chemistry that would make it much easier to recognise
compound names, InChIs, SMILES and CAS registry num-             );
bers. Userscripts, or indeed any other programs, would
then no longer need to depend on regular expressions to      }
find names and identifiers, but could use this markup to
accurately extract the identifier.                           Web resource interfaces
                                                             Web databases are the primary source of information used
For example, a web page implementing the InChI micro-        by the discussed userscripts. While it is easy to have scripts
format would wrap any InChIs in a HTML span ele-           create links to external web resources, it is also possible for
ment with a @class attribute as follows: span               them to retrieve information from those resources and
class=inchiInChI=1//span. This information can          include it in the HTML content of the web page the user is
easily be extracted using the document.evaluate method       browsing. The latter is, for example, performed by the
which takes an XPath [23] expression (//span[@class=in      userscript that adds comments from Postgenomic.com
chi] in this case):                                         and Chemical blogspace to journal web pages.

allInChIs = document.evaluate(                               The general approach userscripts use to retrieve informa-
                                                             tion from external web resources uses HTTP just like any
     '//span[@class=inchi]', document, null,               web browser itself. To simplify the process, userscripts
                                                             tend to use a combination of XMLHttpRequest, possibly
     XpathResult.UNORDERED_NODE_SNAPSHOT_TYPE,               via the Greasemonkey GM_xmlhttpRequest wrapper
                                                             method, and the JavaScript Object Notation (JSON) for-
     null                                                    mat [24] for data representation. The XMLHttpRequest
                                                             method retrieves the information using a URL that nor-
);                                                           mally points to a data interface, or API. The Postge-
                                                             nomic.com software has such an API that returns the blog
This code returns all HTML nodes that mark up InChI          posts that discuss a particular article, as identified by its
strings using the InChI microformat. By iterating over       DOI. Chemical blogspace uses the same API, and adds
these nodes, the userscript can insert new HTML elements,    another one to return blog posts that discuss a particular
such as links to external resources as shown here in code    molecule, as identified by its InChI. Both database APIs
taken from the Sechemtic userscript:                         can return the information as JSON objects, which is how
                                                             they are used in the discussed userscripts.
for (var i=0; iallInChIs.snapshotLength; i++){
                                                             Since our userscripts rely on a particular API or specially-
      spanElement = allInChIs.snapshotItem(i);               constructed URL to access an external resource, they will
                                                             fail if the external resource changes its API or the URL it
      inchi = spanElement.innerHTML;                         provides to access it. This will not affect the browsing
                                                             experience of the user, but the additional functionality
      // create a link to PubChem                            provided by the userscript will no longer be available. To
                                                             deal with this, each of the userscripts described in this arti-
      newElement = document.createElement('a');              cle checks once a day for a new version and prompts the
                                                             user to install it if one is available. This means that when
      newElement.href = http://www.ncbi.nlm. +             a userscript is updated to deal with a new API or URL,
                                                             every user will quickly have access to the latest version.
        nih.gov/entrez/query.fcgi?CMD=search +

        DB=pccompoundterm=%22 + inchi +


                                                                                                                  Page 4 of 12
     BMC Bioinformatcs. 2007, 8, 487.                                                       (page number not for citation purposes)
BMC Bioinformatics 2007, 8:487                                                  http://www.biomedcentral.com/1471-2105/8/487




Results                                                            which can mark documents up automatically. In particu-
This paper introduces userscripts that have been written in        lar, OSCAR3 [25,26], developed at the Unilever Centre for
our research groups as exemplars of how web resources              Molecular Informatics at the University of Cambridge,
can be integrated and to outline how they can be used in           and used by the Royal Society of Chemistry in their
research. Our userscripts can be classified into two broads        Project Prospect [27], searches documents for chemical
areas: those that link chemical and biological data to web-        names, spectra, and other chemical information, and
sites, and those that affect how we interact with the scien-       automatically marks up the content using XML tags (to
tific literature.                                                  the extent of where possible generating machine readable
                                                                   SMILES and InChI structures for chemicals referenced in
In the following sections, we describe in detail how func-         the document).
tionality is added to the web page being browsed. Table 1
summarises the resources linked to, or accessed, by each           We have created a userscript, ChemGM.user.js that will
script, as well as the unique identifier used.                     automatically run OSCAR on a web page and provide
                                                                   inline hypertext links to PubChem for chemical structure
Interacting with the scientific literature                         names that are found in the page (including 2D structure
OSCAR3 running on HTML                                             depictions generated by another web service and
Published journal articles and other web documents with            PubChem searches). The userscript can be run on any web
chemistry content are not normally marked up by the                page, but it is particularly applicable to online journal
publishers or authors to provide machine readable repre-           articles and chemistry blogs. An example highlighting the
sentations of chemical structures and related information.         effect of this userscript is shown in Figure 1. Note that
As a result, there has been active interest in methods             though the images use an article from Chemistry Central




Highlighting and annotating chemical terms in an online journal article
Figure 1
Highlighting and annotating chemical terms in an online journal article. Screenshots showing the effect of the
ChemGM.user.js userscript on the Chemistry Central Journal web page (full URL: [47]) for Majumder et al. [48]. (a) When the
userscript is running a toolbar is added to the top of every webpage. Clicking the highlight button in the toolbar causes the
contents of the webpage to be analysed for chemical terms. (b) shows the original text of the abstract. (c) After a minute or so,
any chemical terms recognised are highlighted in yellow, and are annotated with hypertext links to their entries in PubChem (if
available) and a 2D depiction of the image.




                                                                                                                       Page 5 of 12
  BMC Bioinformatcs. 2007, 8, 487.                                                               (page number not for citation purposes)
BMC Bioinformatics 2007, 8:487                                                http://www.biomedcentral.com/1471-2105/8/487



Journal, the script can be applied to any web page, irrespec-    ever the user accesses the website of a journal publisher. It
tive of its source or content.                                   identifies any DOIs on the page, and uses the Chemical
                                                                 blogspace and Postgenomic APIs to find out whether
Add quotes from Chemical blogspace and Postgenomic to DOIs       those DOIs have been referenced in a blog post. If so, an
It can be a challenge to keep up with the primary literature     icon is added to the web page next to the DOI which, if
in a field. At the same time, there are a large number of sci-   hovered over with the mouse, causes a popup to appear
entific blogs, many of which have reviews of the recent lit-     containing the name of the citing blog post, the blog
erature or highlight interesting papers. The Postgenomic         name, and the first few lines of text of the blog. The full
web site was developed by Euan Adie and later hosted by          content can be accessed by clicking on the title of the blog
Nature Publishing Group and currently aggregates infor-          post. In this way, content from blog articles widely dis-
mation from over 750 scientific blogs [3]. The source code       persed in terms of the web is brought directly to where it
is open and has been used by one of the authors (ELW) to         is likely to be of most interest – the journal web site. Fig-
establish a similar site, Chemical blogspace, for over 140       ure 2 shows the effect of this userscript when running on
blogs with chemical content [4]. Both of these sites iden-       the HTML version of Spjuth et al. [29].
tify references to journal articles in blogs, and make this
information available through an API. Compared to the            Providing reviews of journal articles is only one of the uses
Postgenomic website, the Chemical blogspace site also            of such a userscript. It is also a general way to create a link
identifies molecules referenced in blogs either by micro-        between the content of a blog post and a particular paper.
format markup of InChI and SMILES, or by analysing               In this way, bloggers can use blog posts to enhance the
links to Wikipedia [28]. If the latter link points to a wiki     original journal website without any intervention
page that contains a PubChem compound identifier or an           required by the publisher. For example, the author of a
InChI, then the molecular structure is linked to the blog        paper may write a blog post which provides additional
post.                                                            supporting information for a journal article or includes
                                                                 the article preprint for those who do not have a subscrip-
This userscript uses the aggregated information collected        tion. Alternatively, the author of a paper may write a blog
by Postgenomic and Chemical blogspace. It runs when-             post and include the DOIs of all of the references. This




Figure 2
Adding information to DOIs on journal web pages
Adding information to DOIs on journal web pages. Screenshots from the BMC Bioinformatics web page (full URL: [49])
for Spjuth et al. [29] (a) without any userscript enabled, and (b) showing the effect of the two userscripts Add quotes from
Chemical blogspace and Postgenomic to DOIs and Add to Connotea. The latter added a Connotea logo (a 'c' surrounded by
linking arrows), which links to the Connotea dialog box for adding this paper to your library, and a number indicating how
many people have already bookmarked this paper, which links to the existing entry for this paper on Connotea. The Add
quotes userscript added the Cb logo, which links to the Chemical blogspace page for this paper, and a Pg logo, linking to the
Postgenomic page. The popup titled Powered by Postgenomic.com (only partially shown) appears when the mouse is placed
on the Pg logo, and contains quotes from and links to the citing blog articles.




                                                                                                                      Page 6 of 12
  BMC Bioinformatcs. 2007, 8, 487.                                                              (page number not for citation purposes)
BMC Bioinformatics 2007, 8:487                                                 http://www.biomedcentral.com/1471-2105/8/487



would not only promote his/her own paper (all of the               On a technical note, this userscript illustrates some tech-
cited papers would show a blog comment pointing to the             niques necessary for accessing an API that requires a user
citing paper), but would result in an eventual network of          name and password and that, in addition, only permits
citations which could be used to measure the impact of a           one API request every two seconds or so. Note that this
paper.                                                             userscript requires the user to have a Connotea account
                                                                   (which is freely available at Ref. [30]).
Add to Connotea
Connotea is a social bookmarking site developed by                 Linking to chemical and biological data sources
Nature Publishing Group for scientists [30]. It allows a           Enhancement of PubChem with 3D structures
user to bookmark websites using either the DOI or a URL,           The PubChem repository is a public collection of over 10
and to tag those bookmarks. Crucially, it also provides an         million compounds [32]. The database contains 2D struc-
API for retrieving information.                                    tures as well as a number of precomputed properties (such
                                                                   as number of heavy atoms and topological polar surface
The Add to Connotea userscript has two aspects. Firstly,         area [33]). The web interface to this database allows a
it makes it easy to add papers to Connotea from journal            wide variety of queries. The results are usually represented
webpages, by adding a hyperlink in the form of the Con-            in the form of a summary web page containing images of
notea logo next to every DOI identified on a journal web           the 2D structures of all the compounds satisfying the
page. Clicking on the logo brings the user to the Connotea         query with links to pages for individual compounds
page for adding new papers. This aspect of the userscript          which provide a summary of the properties of the com-
is not entirely novel. A userscript has previously been            pound. In many cases it would be useful to be able to view
developed which allows the user to add papers to Con-              an image of the 3D structure of a molecule. However,
notea from NCBI PubMed [31]. In addition, a small                  PubChem currently does not contain 3D structures for the
number of publishers (which includes BioMed Central                compounds stored in the database.
and Nature Publishing Group), provide a facility to add
papers to Connotea directly from their website. Our user-          To address this problem, we developed a database of 3D
script differs in that it will work on the website of any jour-    structures of PubChem compounds as part of our web
nal publisher where the text contains DOIs.                        service infrastructure for chemoinformatics [34]. The
                                                                   structures were generated using a two-step process in
The second aspect of this userscript is more interesting in        which the SMILES were converted to a set of rough 3D
the context of this paper. The userscript queries the Con-         coordinates using stochastic proximity embedding [35]
notea API to find out how many people have previously              and subsequently geometry optimised using the MMFF94
added this paper to their Connotea account. It then adds           force field, using in-house code. A number of compounds
this number next to the Connotea icon. Clicking on the             were excluded from the final 3D database since the force
number brings you to the Connotea page for that paper.             field did not contain parameters for certain atom types.
From here it is possible to access comments on the paper.          However the 3D database, known as Pub3D [36], con-
More useful perhaps, is the ability to find related papers         tains approximately 99% of the compounds in PubChem.
by looking at the other papers a particular Connotea user          Pub3D is wrapped by a set of web services which encapsu-
has tagged with the same tag. Figure 2 shows the effect of         late common queries including finding a structure by
this userscript when running on the HTML version of                compound ID (CID) or finding structures matching a
Spjuth et al.[29].                                                 SMARTS pattern.

This aspect of the userscript has the potential to affect the      Using this web service interface we created a userscript
way we read the literature. The number of times a particu-         called 3DStructureView.user.js that allows 3D structures
lar paper has been bookmarked on Connotea can be con-              from our database to be shown when users visit the
sidered a measure of its importance or its interest. In the        PubChem website (see Figure 3). The script is designed to
past, measures such as the number of citations have served         work only on the summary and detail pages that a user
this purpose, but this information is generally not shown          views after a PubChem search. It parses the page and iden-
on journal web pages as it is not freely available. Another        tifies the compound ID which is then used to construct a
effect of this userscript is to link the paper the user is view-   call to the Pub3D database. The return value is a string
ing to related papers through the Connotea website. If a           containing the 3D structure of the compound, in SD for-
researcher finds that a particular paper has been book-            mat, which is used to construct an appropriate URL. The
marked on Connotea and is of interest to him or her, he            result of this process is that the user can now click on a
or she can is likely to find other relevant papers by brows-       link titled 3DView(Jmol), which will cause a Jmol applet
ing through the other papers bookmarked by the same                [37-39] window to appear showing the 3D structure of the
Connotea user with the same tag.                                   compound in question.


                                                                                                                      Page 7 of 12
  BMC Bioinformatcs. 2007, 8, 487.                                                              (page number not for citation purposes)
BMC Bioinformatics 2007, 8:487                                              http://www.biomedcentral.com/1471-2105/8/487




Figure 3
Adding 3D models to PubChem
Adding 3D models to PubChem. Screenshot of the PubChem web page for aspirin (full URL: [50]) with the
3DStructureView userscript enabled. The userscript added the first line of text in the compound summary information.
Clicking on the 3DView(Jmol) link causes a window to popup showing a 3D model of the structure. Clicking on the SDF
Format link allows the user to download the calculated 3D structure of the molecule in SDF file format.


As an example, after installing the script, one can navigate    download the 3D structure in the SD file format (see Fig-
to the PubChem website [32] and search for entries              ure 3).
related to aspirin. This should return slightly more than
thirty hits. If one then clicks on the compound ID for the      PDB-Jmol Greasemonkey Script
first hit, one is taken to a summary page which provides        The Protein Data Bank [40] is a repository of experimen-
various details regarding the molecular structure and bio-      tally-determined 3D coordinates of proteins. Each entry
logical activity of aspirin. In addition to the data provided   has a PDB ID, which is a unique four letter identification
by PubChem, the userscript has enhanced the page to add         code consisting of a number followed by three characters
two links: 3DView(Jmol) and SDF Format. The former link         which can be either letters or numbers; for example, 1abe,
will bring up an instance of the Jmol applet showing a 3D       114L and 6NN9. The PDB-Jmol Greasemonkey userscript
structure of aspirin, while the second link allows one to       identifies all PDB IDs on web pages and adds hyperlinks


                                                                                                                  Page 8 of 12
  BMC Bioinformatcs. 2007, 8, 487.                                                          (page number not for citation purposes)
BMC Bioinformatics 2007, 8:487                                                   http://www.biomedcentral.com/1471-2105/8/487



to the FirstGlance in Jmol web page [41] for that protein.          in particular with Google searches, the links based on
This website uses the Open Source molecular viewer Jmol             InChIs are more useful as the same molecule may be rep-
to show the protein as a 3D model which can be manipu-              resented by several different SMILES strings but only a sin-
lated by the user. In this way, the user can instantly view         gle InChI.
the 3D structure of any PDB ID mentioned on a website,
and in particular, if the user is reading the HTML version          From a technological point of view, these scripts are very
of a journal article on-line, all PDB IDs in the paper will         simple in nature; the semantic nature of the (chemical)
similarly be enhanced. Figure 4 shows an example of the             microformats is what makes this simple script possible.
latter case where PDB identifiers in the the online version         The semantic markup in HTML for InChIs that is picked
of Mardia et al. [42] have been identified and links added.         up by the userscript looks like span class=inchiInChI
                                                                    =1/CH4/h1H4/span while the markup for a SMILES
As this userscript runs on all web pages accessed by the            string looks like span class=smilesCCO/span.
user, and since the search term is simply 4 characters long,
additional constraints are necessary to prevent excessive           Add quotes from Chemical blogspace to molecules
false positive identification. The userscript only looks for        This userscript, quite similar to the one that adds com-
PDB IDs if it finds one of the following terms in the web           ments to DOIs, runs on all web pages accessed by the user.
page: protein, PDB, or enzyme.                                Using the same method as the Sechemtic userscript (see
                                                                    above), it identifies any molecules referenced on a page
Sechemtic                                                           which have been marked up with the appropriate tags. It
Sechemtic is a small userscript that detects use of micro-          also supports the (non-marked up) InChI tags on
formats (see Implementation) to markup molecular iden-              PubChem. It then uses the Chemical blogspace API to find
tifiers, as well as regular molecular names. It recognises          out whether this molecule has been referenced in a blog
markup for the IUPAC InChI and SMILES, and creates                  post. The remainder is as for the previous userscript; an
links for those molecules to web resources like eMolecules          icon is added which contains a popup to the citing blog
[43], PubChem [32] and a link to Google to search for               post. Figure 6 shows the effect of the userscript on the
more information (see Figure 5). It should be noted that,           PubChem page for methane (InChI=1/CH4/h1H4). A full




Figure 4
The effect of the PDB-Jmol userscript
The effect of the PDB-Jmol userscript. Screenshots from the BMC Bioinformatics web page (full URL: [51]) for Mardia et al.
[42] showing a paragraph containing PDB identifiers (a) without the PDB-Jmol userscript installed, and (b) with the PDB-Jmol
userscript installed. The Jmol text in yellow is a hyperlink to the FirstGlance in Jmol page [41] for a particular protein struc-
ture.




                                                                                                                         Page 9 of 12
  BMC Bioinformatcs. 2007, 8, 487.                                                                 (page number not for citation purposes)
BMC Bioinformatics 2007, 8:487                                              http://www.biomedcentral.com/1471-2105/8/487




Figure 5
Annotating chemical terms marked up with microformats
Annotating chemical terms marked up with microformats. Screenshots showing a blog post (full URL: [52]) containing
chemical terms marked up with chemical microformats, (a) without and (b) with the Sechemtic userscript enabled. The added
hyperlinks allow the user to look up the structure in Google, ChemSpider and PubChem.



list of molecules with comments in Chemical blogspace is        and any use of chemical microformats will be picked up
available from Ref. [44].                                       adding links to Google, eMolecules, ChemSpider and
                                                                PubChem.
A possible use of this script is to link all discussions of a
particular drug in the blogosphere to a static page contain-    These examples show that userscripts offer a powerful
ing information on the drug. Another use is to link discus-     technology to improve the way we read the scientific liter-
sions on syntheses of molecules to pages containing             ature and access (bio)chemical databases. This is done by
references to the molecule.                                     dynamically combining web resources, and enriching the
                                                                information content of the primary resources. Theoreti-
Discussion                                                      cally, such links can be made on the web server itself, and
Here we have focused on the development of userscripts          this is commonly done, but it does not give the user the
that enhance web pages for biologists and chemists. If all      flexibility to choose what features to install. The crucial
of these userscripts are installed, any web page with a PDB     point about userscripts is that they do not require the
code will now contain a link to view the structure in 3D,       involvement of the web site provider. All of the enhance-
journal webpages will show chemical structure markup            ments are done on-the-fly by the user's browser.
and blog comments on articles, 3D structures and links to
appropriate blog posts will be available from PubChem,


                                                                                                                  Page 10 of 12
  BMC Bioinformatcs. 2007, 8, 487.                                                           (page number not for citation purposes)
BMC Bioinformatics 2007, 8:487                                              http://www.biomedcentral.com/1471-2105/8/487




Figure 6
Adding comments from the blogosphere to molecules
Adding comments from the blogosphere to molecules. Screenshots from the PubChem web page for methane (full
URL: [53]), (a) without and (b) with the Add quotes from Chemical blogspace to molecules userscript enabled. The InChI
InChI=1/CH4/h1H4 is identified by the userscript, which then adds the Cb logo. The logo is a link to the Chemical blogspace
page for this molecule. The popup titled Powered by Chemical blogspace (only partially shown) appears when the mouse is
placed on the Cb logo, and contains quotes from and links to blog posts that discuss this molecule.


The userscripts combine a number of technologies for            Conclusion
data retrieval and communication. Information from              We have shown that userscripts are a simple and useful
HTML pages is extracted using identifiers, regular expres-      way of integrating bio- and chemoinformatics web
sions, XPath queries and microformats. It is noted that the     resources. In particular, they permit (a) the augmentation
syntax of (bio)chemical and other identifiers is generally      of existing websites with functionality not envisioned or
not distinct enough to detect them with perfect recall and      indeed wanted by the original author, (b) the integration
optimal precision. It is easiest to write regular expressions   of information from different domains, and (c) a connec-
for the DOI and the InChI with a high precision, com-           tion point between the social web (wikis, blogs etc.) and
pared to, for example, the PDB ID which has a syntax            traditional web tools and sites. We continue to find inter-
which can clash with other web page content.                    esting uses for userscripts, and we hope this manuscript
                                                                will spur others to do likewise.
Microformats offer a solution for such less well-defined
identifiers. This technology is used to wrap identifiers        Availability and requirements
with some semantic markup so that the userscript can eas-       • Project name: Userscripts for Chemistry and Biology
ily extract the identifiers using XPath queries. However,
microformats do not incorporate a mechanism to provide          • Project home page: Blue Obelisk [45] website [46].
details on what a microformat means. That is, microfor-         Download link: http://blueobelisk.sf.net/wiki/Userscripts
mats are not backed up by a specified ontology. As a result
the chemical 'smiles' microformat, to markup SMILES,            • Operating system(s): Platform independent
may collide with a microformat specification to markup
moods.                                                          • Programming language: JavaScript

Once the identifier is extracted by whatever means, the         • Other requirements: Firefox with Greasemonkey add-
userscripts can either create links to other web resources,     on (or equivalent) for userscript support; Java is required
or query those resources and embed results into the HTML        to view the Jmol applet; a Connotea account is required
of the web page on which the userscript is run. While any       for the Add to Connotea userscript
HTTP-based approach can be used for this, the example
userscripts show that combining XMLHttpRequest with             • License: GNU GPL, BSD
JSON [24] is a rather straightforward approach.
                                                                • Any restrictions to use by non-academics: none




                                                                                                                  Page 11 of 12
  BMC Bioinformatcs. 2007, 8, 487.                                                           (page number not for citation purposes)
BMC Bioinformatics 2007, 8:487                                                                  http://www.biomedcentral.com/1471-2105/8/487




Authors' contributions                                                          28.   Wikipedia [http://www.wikipedia.org/]
                                                                                29.   Spjuth O, Helmus T, Willighagen EL, Kuhn S, Eklund M, Wagener J,
NMOB, ELW, HG and DJ have written userscripts men-                                    Murray-Rust P, Steinbeck C, Wikberg JES: Bioclipse: An open
tioned in this text. RG developed and maintains the 3D                                source workbench for chemo- and bioinformatics. BMC Bioin-
structure database and contributed to the development of                              formatics 2007, 8:59.
                                                                                30.   Connotea [http://www.connotea.org/]
the Pub3D userscript. DW and CS devised and tested                              31.   pubmed2connotea                2006      [http://lindenb.integragen.org/
some of the userscripts. All authors have read and                                    pubmed2connotea/].
approved the final manuscript.                                                  32.   PubChem [http://pubchem.ncbi.nlm.nih.gov/]
                                                                                33.   Ertl P, Rohde B, Selzer P: Fast Calculation of Molecular Polar
                                                                                      Surface Area as a Sum of Fragment Based Contributions and
Acknowledgements                                                                      Its Application to the Prediction of Drug Transport Proper-
Pedro Beltrão is acknowledged for laying the foundations of the userscript            ties. J Med Chem 2000, 43:3714-3717.
                                                                                34.   Dong X, Gilbert KE, Guha R, Heiland R, Kim J, Pierce ME, Fox GC,
that adds blog comments to journal web pages. We thank the anonymous                  Wild DJ: Web Service Infrastructure for Chemoinformatics.
reviewers for their constructive comments.                                            J Chem Inf Model 2007, 47:1303-1307.
                                                                                35.   Agrafiotis DK: Stochastic Proximity Embedding. J Comp Chem
                                                                                      2003, 24:1215-1221.
References                                                                      36.   Pub3D [http://rguha.ath.cx/~rguha/cicc/p3d/]
1.    Galperin MY: The Molecular Biology Database Collection:                   37.   Jmol: an open-source Java viewer for chemical structures in
      2007 update. Nucleic Acids Res 2007, 35:D3-D4.                                  3D [http://www.jmol.org]
2.    Fox JA, McMillan S, Ouellette BFF: A compilation of molecular             38.   Willighagen E, Howard M: Fast and Scriptable Molecular Graph-
      biology web servers: 2006 update on the Bioinformatics                          ics in Web Browsers without Java3D. Available from Nature Pre-
      Links Directory. Nucleic Acids Res 2006, 34:W3-W5.                              cedings 2007 [http://dx.doi.org/10.1038/npre.2007.50.1].
3.    Postgenomic [http://postgenomic.com/]                                     39.   Herráez A: Biomolecules in the computer: Jmol to the rescue.
4.    Chemical blogspace [http://cb.openmolecules.net/]                               Biochem Mol Biol Edu 2006, 34:255-261.
5.    Good BM, Kawas EA, Kuo BY, Wilkinson MD: iHOPerator: User-                40.   Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H,
      scripting a personalized bioinformatics Web, starting with                      Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids
      the iHOP website. BMC Bioinformatics 2006, 7:534.                               Res 2000, 28(1):235-242.
6.    Etzold T, Ulyanov A, Argos P: SRS: information retrieval system           41.   FirstGlance in Jmol [http://firstglance.jmol.org/]
      for molecular biology data banks. Method Enzymol 1996,                    42.   Mardia KV, Nyirongo VB, Green PJ, Gold ND, Westhead DR: Baye-
      266:114-128.                                                                    sian refinement of protein functional site matching. BMC Bio-
7.    Lee T, Pouliot Y, Wagner V, Gupta P, Calvert DS, Tenenbaum J, Karp              informatics 2007, 8:257.
      P: BioWarehouse: a bioinformatics database warehouse                      43.   eMolecules [http://emolecules.com/]
      toolkit. BMC Bioinformatics 2006, 7:170.                                  44.   Chemical blogspace – Chemical Compounds [http://cb.open
8.    ChemSpider [http://chemspider.com/]                                             molecules.net/inchis.php]
9.    Karthikeyan M, Krishnan S, Pandey AK, Bender A: Harvesting                45.   Guha R, Howard MT, Hutchison GR, Murray-Rust P, Rzepa H, Stein-
      Chemical Information from the Internet Using a Distributed                      beck C, Wegner J, Willighagen EL: The Blue Obelisk-interopera-
      Approach: ChemXtreme. J Chem Inf Model 2006, 46:452-461.                        bility in chemical informatics.             J Chem Inf Model 2006,
10.   BioSpider [http://biospider.ca/]                                                46:991-998.
11.   Knox C, Shrivastava S, Stothard P, Eisner R, Wishart D: BioSpider:        46.   Blue Obelisk Userscripts           [http://blueobelisk.sourceforge.net/
      A Web Server for Automating Metabolome Annotations.                             wiki/Userscripts]
      Pacific Symp Biocomp 2007, 12:145-156.                                    47.   Chemistry Central Journal web page for Majumder et al
12.   Wilkinson MD, Links M: BioMOBY: an open source biological                       [http://journal.chemistrycentral.com/content/1/1/10]
      web services proposal. Brief Bioinform 2002, 3(4):331-341.                48.   Majumder AB, Shah S, Gupta MN: Enantioselective transacetyla-
13.   Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM,                tion of (R,S)-β-citronellol by propanol rinsed immobilized
      Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-           Rhizomucor miehei lipase. Chem Cent J 2007, 1:10.
      Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M,     49.   BMC Bioinformatics web page for Spjuth et al                     [http://
      Rubin GM, Sherlock G: Gene ontology: tool for the unification                   www.biomedcentral.com/1471-2105/8/59]
      of biology. The Gene Ontology Consortium. Nat Genet 2000,                 50.   PubChem page for aspirin             [http://pubchem.ncbi.nlm.nih.gov/
      25:25-29.                                                                       summary/summary.cgi?cid=2244]
14.   IUPAC International Chemical Identifier (InChI)               [http://    51.   BMC Bioinformatics web page for Mardia et al                     [http://
      www.iupac.org/inchi/]                                                           www.biomedcentral.com/1471-2105/8/257]
15.   Greasemonkey [http://www.greasespot.net/]                                 52.   Counting consitutional isomers from the molecular formula
16.   Creammonkey [http://creammonkey.sourceforge.net/]                               [http://chem-bla-ics.blogspot.com/2006/12/counting-stereoisomers-
17.   IE7pro [http://www.ie7pro.com/]                                                 from-molecular_17.html]
18.   Turnabout [http://www.reifysoft.com/turnabout.php]                        53.   PubChem page for methane [http://pubchem.ncbi.nlm.nih.gov/
19.   Coles SJ, Day NE, Murray-Rust P, Rzepa HS, Zhang Y: Enhance-                    summary/summary.cgi?cid=297]
      ment of the chemical semantic web through the use of InChI                54.   Google [http://google.com/]
      identifiers. Org Biomol Chem 2005, 3(10):1832-1834.
20.   Microformats [http://microformats.org/]
21.   hCard Microformat [http://microformats.org/wiki/hcard]
22.   hCalendar Microformat            [http://microformats.org/wiki/hcalen
      dar]
23.   XML Path Language (XPath) 2.0 – W3C Recommendation
      [http://www.w3.org/TR/2007/REC-xpath20-20070123/]
24.   JSON [http://json.org/]
25.   Townsend JA, Adams SE, Waudby CA, de Souza VK, Goodman JM,
      Murray-Rust P: Chemical documents: machine understanding
      and automated information extraction. Org Biomol Chem 2004,
      2:3294-3300.
26.   Corbett P, Murray-Rust P: High-thoughput identification of
      chemistry in life science texts. In Computational Life Sciences II Vol-
      ume 4216. Edited by: Berthold MR, Glen R, Fischer I. Berlin/Heidel-
      berg: Springer-Verlag; 2006:107-118.
27.   RSC Project Prospect [http://www.rsc.org/Publishing/Journals/
      ProjectProspect/]


                                                                                                                                            Page 12 of 12
     BMC Bioinformatcs. 2007, 8, 487.                                                                                 (page number not for citation purposes)
O’Boyle et al. Journal of Cheminformatics 2011, 3:8
   http://www.jcheminf.com/content/3/1/8




    SOFTWARE                                                                                                                                        Open Access

   Confab - Systematic generation of diverse low-
   energy conformers
   Noel M O’Boyle1,2*, Tim Vandermeersch2, Christopher J Flynn1, Anita R Maguire1 and Geoffrey R Hutchison2,3


     Abstract
     Background: Many computational chemistry analyses require the generation of conformers, either on-the-fly, or in
     advance. We present Confab, an open source command-line application for the systematic generation of low-
     energy conformers according to a diversity criterion.
     Results: Confab generates conformations using the ‘torsion driving approach’ which involves iterating
     systematically through a set of allowed torsion angles for each rotatable bond. Energy is assessed using the
     MMFF94 forcefield. Diversity is measured using the heavy-atom root-mean-square deviation (RMSD) relative to
     conformers already stored. We investigated the recovery of crystal structures for a dataset of 1000 ligands from the
     Protein Data Bank with fewer than 1 million conformations. Confab can recover 97% of the molecules to within 1.5
     Å at a diversity level of 1.5 Å and an energy cutoff of 50 kcal/mol.
     Conclusions: Confab is available from http://confab.googlecode.com.


   Introduction                                                                           systematic search code from DOCK5 [14] to generate
   The generation of molecular conformations is an essen-                                 diverse conformers via a torsion-driving approach.
   tial part of many computational analyses in chemistry,                                    Confab 1.0 is the first release of Confab, an open
   particularly in the field of computational drug design.                                source conformation generator whose goal is the sys-
   Methods such as 3D QSAR, protein-ligand docking and                                    tematic coverage of conformational space. Accuracy has
   pharmacophore generation and searching [1] all require                                 been favoured over the introduction of approximations
   the generation of conformers, whether on-the-fly (as part                              to improve performance. The algorithm starts with an
   of the method) or pre-generated by a stand-alone confor-                               input 3D structure which, after some initialisation steps,
   mer generator. In contrast to 3D structure generators                                  is used to generate multiple conformers which are
   (such as CORINA [2], DG-AMMOS [3] and smi23d [4]),                                     filtered on-the-fly to identify diverse low energy confor-
   which focus on the generation of a single low-energy                                   mers. Conformations are generated using the torsion-
   conformation, conformation generators create an ensem-                                 driving approach from a set of predefined allowed torsion
   ble of conformers that cover the entire space of low-                                  angles. Ring conformations are not currently sampled.
   energy conformations or that part of conformational                                       The first section of the paper describes the algorithm
   space occupied by biologically-relevant conformers.                                    used by the software and some implementation details.
     Several proprietary conformation generators are cur-                                 After this, two applications of the software are
   rently available (including OMEGA [5], ROTATE [6],                                     described: an analysis of the conformational space of a
   Catalyst [7], Confort [8], ConfGen [9], Balloon [10] and                               dataset of 1000 molecules (which includes a comparison
   MED-3DMC [11] among others) but only recently have                                     to Multiconf-DOCK), and an investigation of the con-
   open source conformation generators appeared: Frog2                                    formational preferences of a particular phenyl sulfone.
   [12] generates conformers using a Monte Carlo
   approach, while Multiconf-DOCK [13] adapts the                                         Methods
                                                                                          Algorithm
   * Correspondence: baoilleach@gmail.com                                                 The Confab algorithm is outlined in Figure 1. The input
   1
    Analytical and Biological Chemistry Research Facility, University College             required is a 3D structure with reasonable bond lengths
   Cork, Western Road, Cork, Co. Cork, Ireland                                            and angles. Since the algorithm does not currently
   Full list of author information is available at the end of the article

                                            © 2011 O’Boyle et al; licensee Chemistry Central Ltd. This is an Open Access article distributed under the terms of the Creative
                                            Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
                                            reproduction in any medium, provided the original work is properly cited.




J. Cheminf. 2011, 3, 8.
O’Boyle et al. Journal of Cheminformatics 2011, 3:8                                                             Page 2 of 9
http://www.jcheminf.com/content/3/1/8




                                                             reducing the number of conformations that will be
                                                             tested. 2-fold symmetry is identified when a rotatable
                                                             bond involves an sp2 hybridised carbon atom where the
                                                             neighbouring two atoms affected by the rotation are
                                                             both of the same symmetry class. When this occurs the
                                                             allowed values of that torsion are halved by restricting
                                                             them to those less than 180°. The same is done for the
                                                             case of 3-fold symmetry at an sp 3 hybridised carbon
                                                             where the three neighbours are of the same symmetry
                                                             class; in this case the torsion angles are restricted to
                                                             those less than 120°. If graph symmetry is identified at
                                                             both ends of a rotatable bond, the result is multiplica-
                                                             tive; a 2-fold and a 3-fold symmetry combine to restrict
                                                             allowed values of the torsion angles to 360/6 = 60°.
                                                               The next step is to obtain an estimate of the energy of
                                                             the most stable conformer. Throughout Confab, ener-
                                                             gies are calculated using the MMFF94 forcefield [16].
                                                             The values of the bond stretching, angle bend, stretch
                                                             bend and out-of-plane bending terms are constant for
                                                             all conformers of the same molecule; only the torsion,
                                                             Van der Waals and electrostatic terms were repeatedly
                                                             evaluated. A low energy conformer is found using a sim-
                                                             ple greedy algorithm. Each torsion angle is optimised
                                                             starting with the most central torsion and proceeding
                                                             outwards. As this procedure is relatively fast (compared
                                                             to the combinatorial problem of searching for the global
                                                             optimum) it is repeated up to 16 times by testing the
 Figure 1 Flowchart depicting the Confab algorithm.
                                                             four most central torsions in different orders. The low-
                                                             est energy conformer found is used as a reference point
                                                             for applying an energy cutoff during the conformer
explore ring conformations, any rings present should be      search. If, during the actual conformer generation a
in reasonable conformations.                                 lower energy conformer is found, this lower energy is
  The first step of the algorithm is the identification of   used instead for the reference from that point on.
rotatable bonds. These are defined as all acyclic single       The main part of the algorithm is the systematic gen-
bonds where both atoms of the bond are connected to          eration and assessment of all conformers described by
at least two non-hydrogen atoms, but neither atom of         the allowed torsion angles. Confab generates each of
the bond is sp-hybridised. Note that this definition         these in turn up to a user-specified cutoff (the default is
excludes rotation around bonds that interchange hydro-       10 6 ) and determines its energy relative to the lowest
gens (for example, the rotation of the hydrogens of a        energy conformer found so far. If this is within a user-
methyl group), but this does not imply any loss of accu-     specified energy cutoff (50 kcal/mol by default), it is
racy as it is usual practice to exclude hydrogens when       assessed for diversity to the conformers already stored
calculating the RMSD (see below).                            (see below). If it is found to be diverse, it is itself stored
  The method used by Confab to generate conformations        otherwise it is discarded. The algorithm then moves
is known as the torsion-driving approach. A set of           onto the next conformer.
allowed torsion angles for each rotatable bond is assigned     Rather than iterate in a ‘depth-first’ manner over the
to each bond by searching for a match to predefined          torsions and their allowed angles, Confab uses a Linear
SMARTS strings in a user-configurable file (torlib.txt)      Feedback Shift Register (LFSR) to iterate in a random
included in the Confab distribution. This file is part of    order over all of the conformers. A LFSR allows the
the Open Babel project and it assigns values to particular   generation of all integers from 1 to N pseudorandomly
rotatable bonds using data from Huang et al. [15].           without repetition and without any memory overhead
  Once the allowed torsion angles are assigned, they are     (which is important for large values of N). By iterating
corrected for topological (that is, graph) symmetry. The     randomly, Confab avoids biasing generated conformers
presence of such symmetry allows performance to be           towards a particular region of conformational space, for
improved by eliminating redundant evaluations, thus          example towards the input conformation. It also helps



J. Cheminf. 2011, 3, 8.
O’Boyle et al. Journal of Cheminformatics 2011, 3:8                                                                                Page 3 of 9
http://www.jcheminf.com/content/3/1/8




increase diversity if the number of possible conforma-                   child nodes containing that conformation are added at
tions is greater than the cutoff for the number tested.                  successively lower levels until the bottom level is
   Diversity is ensured by calculating the heavy-atom                    reached. Overall, there are two possibilities; either the
RMSD (after least-squares alignment) of the newly gen-                   algorithm reaches the bottom level and finds that the
erated conformation to those previously stored. The                      new conformation is within the RMSD cutoff of an
alignment is carried out using the QCP algorithm of                      existing conformer, in which case it is discarded, or else
Theobald [17] (which we found to be about twice as fast                  it is of sufficient diversity to be stored at some level of
as the popular Kabsch alignment method [18]). Despite                    the tree.
this, when a molecule has many conformers and a large                       This algorithm greatly reduces the number of RMSD
number of conformers have been stored, full pairwise                     evaluations during the conformer generation loop. How-
RMSD calculations take an excessive amount of time.                      ever it does not eliminate all conformations that are
To minimise the number of RMSD evaluations required                      similar to those already stored; conformations may be
to discard a conformer, chosen conformers are stored in                  retained that differ by less than the RMSD cutoff if they
a tree structure that effectively clusters conformers on-                end up in different branches. To prune the set of
the-fly by RMSD. Figure 2(a) shows a typical ‘diversity                  retained conformations, while still avoiding a computa-
tree’ where each level of the tree is associated with a                  tionally expensive pairwise RMSD calculation, all of the
smaller RMSD diversity from 3.0 Å down to the cutoff                     retained conformations are added one-by-one to a new
specified by the user (1.6 Å in the figure). Each node of                tree in order of increasing energy. This time the algo-
the tree represents a stored conformation. Sibling nodes                 rithm used for adding conformations to the diversity
(that is, nodes at the same level that share the same par-               tree is more robust: all sibling conformations are tested
ent node) differ by at least the RMSD diversity asso-                    for similarity, even after finding one that is similar. The
ciated with that level. Note that sibling nodes are                      result is that the same conformation may be added at
ordered and that the first child node of each parent is                  several different points in the tree. This makes the tree
the same as the parent itself.                                           more effective at eliminating similar conformations at
   To illustrate the algorithm, let us imagine adding a                  the expense of a greater number of RMSD calculations.
new conformation H to the tree depicted in Figure 2(a).                     Calculation of an RMSD can be overestimated when a
The algorithm starts at the top of the tree and deter-                   molecule’s structure has automorphisms (a permutation
mines which of the two branches (A or B) to take at the                  of the atoms of a molecule that preserves the bond con-
3.0 Å diversity level. To do so it checks whether H is                   nections). For example, if you consider a para-substituted
within 3.0 Å RMSD of A. If so, it follows the tree down                  phenyl ring where two conformations differ by a rotation
to the next level, and checks to see whether it is within                of 180° around the substituted carbons, it is clear that the
2.0 Å RMSD of A (note that it does not need to recalcu-                  calculated RMSD between the conformations should be
late the RMSD to do this). If this is not true, then it                  0. However, if the symmetry of the phenyl ring is not
checks for 2.0 Å similarity to C. If so, it follows C down               taken into account this will not be the case and the
to the next level; otherwise it checks against D. If it is               RMSD will be overestimated as the corresponding atoms
not similar to D, H is stored in the tree as the next sib-               of the two structures have moved. The symmetry-
ling at that level of the tree (this is depicted in Figure 2             corrected RMSD is obtained by iterating over the auto-
(b)). When adding a new node for a conformation at a                     morphisms of the molecule and taking the minimum
particular level, if the level is not at the bottom then                 value of the resulting RMSDs. For performance reasons,




 Figure 2 An example diversity tree used to filter conformations on-the-fly. (a) A diversity tree containing five conformations (A to E) used
 to filter conformations with an RMSD of less than 1.6 Å to one of the stored conformations. (b) The same diversity tree after addition of
 conformer H, where H is within 3.0 Å of A but not within 2.0 Å of A, C or D.




J. Cheminf. 2011, 3, 8.
O’Boyle et al. Journal of Cheminformatics 2011, 3:8                                                                    Page 4 of 9
http://www.jcheminf.com/content/3/1/8




the calculation of the RMSD is not symmetry-corrected          the 3D structure using Open Babel. After the initial
during the main conformation generation loop. However          structure generation, the structures were optimised
it is used afterwards when building the final diversity        using the MMFF94 forcefield (200 steps steepest des-
tree, thereby eliminating any conformations that were          cent). Since Confab does not explore ring conforma-
retained in error.                                             tions, ring conformations were taken from the crystal
                                                               structure for the initial structure generation. See Addi-
Implementation                                                 tional file 2 for the generated structures.
Confab is essentially a modified version of Open Babel
[19], a widely-used cheminformatics toolkit written in         Results
C++ and available under the open source GPL v2                 Figure 3(b) shows an overview of the dataset of 1000
licence [20]. In fact, some of the code written for Con-       structures in terms of the number of rotatable bonds in
fab has been merged into the main Open Babel distribu-         each molecule. Although the dataset contains molecules
tion (such as the original Kabsch alignment code) but          with up to 12 rotatable bonds, it is clear by comparison
due to an additional dependency (on tree.hh, see below)        with the full dataset of Borodina et al. in Figure 3(a)
the core code has not been included in Open Babel v2.3.        that the reduced dataset is only a representative sample
   The MMFF94 forcefield, the conformer generation fra-        for molecules having up to 7 rotatable bonds. Beyond
mework and the automorphism detection are all pro-             this, the restriction that the molecule must have fewer
vided by Open Babel. QCP alignment was implemented             than 1 million conformers leads to the elimination of
using Theobald’s public domain code [21] in combina-           most of the molecules. For this reason, to avoid
tion with the Eigen2 high performance linear algebra
library [22]. The diversity analysis code relies on a tree
data structure provided by the Open Source tree.hh
library [23]. The code used to implement the Linear
Feedback Shift Register (LFSR) was adapted from its cor-
responding Wikipedia article [24]. Tap values for the reg-
ister were taken from Alfke’s Xilinx application note [25].
   The Confab distribution contains two command-line
applications: confab and calcrmsd. The former imple-
ments the Confab algorithm to generate conformers
given an input 3D structure, while the latter may be
used to assess the performance of confab by comparing
the generated conformers to a file containing crystal
structures. Full details of these applications are available
on the Confab website.

Coverage of Conformational Space
Dataset
To illustrate the performance of Confab, we used a
dataset of 1000 small molecule crystal structures derived
from that of Borodina et al. [26]. The original source is
the PDB; thus this dataset represents bioactive confor-
mations of molecules. The 3D structures of the 14504
ligands in the Borodina dataset were obtained using the
PubChem Download Service (using the PubChem Sub-
stance IDs from Borodina et al.). Of these, 16 could not
be handled by the MMFF94 forcefield, 5202 had no
rotatable bonds (this fraction included a large number
of trivial salts) and 2348 had more than 1 million con-
formers (according to Confab’s torsion rules). 1000
structures were randomly chosen from the 6938 remain-
ing. See Additional file 1 for the structures of these
1000 molecules.                                                 Figure 3 The distribution of molecules in terms of the number
   To avoid bias towards the crystal structures, the input      of rotatable bonds in (a) the dataset of Borodina et al., and (b)
                                                                our dataset of 1000 molecules.
conformations for Confab were generated by building



J. Cheminf. 2011, 3, 8.
O’Boyle et al. Journal of Cheminformatics 2011, 3:8                                                                     Page 5 of 9
http://www.jcheminf.com/content/3/1/8




erroneous conclusions some of the following analyses
(where stated) will not consider molecules having 8 or
more rotatable bonds.
  Confab was used to exhaustively generate all low
energy conformers for each molecule in the dataset for
diversity values ranging from 0.4 Å to 3.0 Å RMSD. The
default setting of 50 kcal/mol was used as an energy
cutoff. The default value of 1 million conformers was
used as the conformer cutoff; this ensured exhaustive
coverage of conformational space (as defined by Con-
fab’s torsion rules) as structures with more conformers
were not included in the dataset (see above). Figure 4
shows the mean time for conformer generation per
molecule. This is largely independent of the diversity
level for diversity levels greater than or equal to 1.0 Å.
For values less than this, an increasing amount of time
is spent performing the pairwise RMSD calculations
against stored conformations.
  Performance of conformer generators is typically mea-
sured by the percent recovery of crystal structures with
respect to a particular RMSD cutoff (see for example
Ref [9]). This is simply the percentage of molecules
which have a generated conformer within a particular
RMSD of the crystal structure. Commonly used values
for this RMSD cutoff are 2.0, 1.5 and 1.0 Å.
  Figure 5(a) shows the percent recovery at these cutoffs
for different values of the RMSD diversity. At 2.0 Å
RMSD diversity, 99% are within 2.0 Å RMSD of the
crystal (83% within 1.5, 41% within 1.0); at 1.5 Å RMSD
diversity, 99% are within 2.0 Å (97% within 1.5, 50%
within 1.0); at 1.0 Å RMSD diversity, 99% are within 2.0
Å RMSD (98% within 1.5, 89% within 1.0). As expected,


                                                              Figure 5 Performance measured as % recovery of crystal
                                                              structures. (a) Performance for different RMSD cutoffs. The diversity
                                                              cutoff is where the value of the RMSD diversity is used as the RMSD
                                                              cutoff. (b) The RMSD cutoff required to achieve a particular level of
                                                              % recovery. The diagonal line indicates the maximum RMSD cutoff
                                                              expected when there is complete coverage.



                                                             the percentage of crystal structures that are found
                                                             decreases as the RMSD diversity increases. In particular,
                                                             the curves fall off steeply once the RMSD diversity is
                                                             greater than the required cutoff.
                                                               An interesting question to ask is what RMSD diversity
                                                             is required to recover X% of crystal structures with
                                                             respect to a particular RMSD cutoff? Figure 5(b) shows
                                                             the answer to this where X is 90%, 95% or 98%. For
                                                             example to find 95% of the crystal structures within a
                                                             2.0 Å cutoff an RMSD diversity of 2.4 Å (or smaller) is
 Figure 4 Effect of diversity level on speed of conformer    required, but to find the same percentage to within
 generation. Times were measured on an Intel Xeon E5620
                                                             1.5 Å an RMSD diversity of 1.6 Å is needed. However,
 Processor (2.4GHz, 4C) with 32GB RAM.
                                                             even an RMSD diversity of 0.4 Å will not recover 98%



J. Cheminf. 2011, 3, 8.
O’Boyle et al. Journal of Cheminformatics 2011, 3:8                                                                       Page 6 of 9
http://www.jcheminf.com/content/3/1/8




of the structures to within 1.0 Å (it only recovers 96%),              different levels of RMSD diversity when the RMSD cut-
an indication of the inherent diversity of the generated               off used is the same as the diversity level. The sharp fall
conformers as discussed further below.                                 off below 1.4 Å is a deviation from the ideal behaviour
  As pointed out by Borodina et al. [26], if the confor-               described by Borodina et al.
mational space is perfectly covered and lacks any ‘holes’                Table 1 shows the median number of generated con-
then the RMSD diversity is an upper bound of the mini-                 formers tested for molecules with different numbers of
mum RMSD to the crystal structure. In other words, at                  rotatable bonds. Broadly speaking, about one third of
an RMSD diversity of 1.5 Å for example, all crystal                    the conformers pass the energy cutoff applied. Although
structures should be found to within 1.5 Å. The diago-                 the size of each individual subset is not very large, and
nal line in Figure 5(b) indicates the maximum RMSD                     the values for 6 rotatable bonds seem to be biased
cutoff expected if this ideal behaviour is observed. It is             towards a larger number of conformers, some general
clear from the figure that at low RMSD diversity the                   points can still be made.
actual performance is poorer than this.                                  The number of diverse conformers is much reduced
  There are two main problems that give rise to gaps in                by a higher diversity level. For example, for those mole-
conformational coverage. The first is that the allowed                 cules with 7 rotatable bonds there are approximately
torsion values may not encompass the specific torsion                  11000 low energy conformers of which about 13% are
angle observed in the crystal structure. For this dataset,             diverse at 0.5 Å RMSD, only 1.3% are diverse at 1.0 Å
there are 7 molecules for which the crystal structure                  RMSD, and only 0.16% are diverse at 1.5 Å RMSD.
could not be found within 2.0 Å even at 0.4 Å RMSD                       The values in Table 1 are in broad agreement with
diversity. These molecules (PubChem substance IDs of                   those reported by Smellie et al. [27] for a representative
584680, 823881, 825747, 826196, 828032, 830919 and                     subset of their dataset (see table three therein). They
834618), of which two represent different conformations                make the point that the number of conformers required
of the same molecule, all involve sugar moieties and it                to cover conformational space is really surprisingly low.
may be that the allowed torsion angles of the glycosidic               For a molecule with 7 rotatable bonds in our dataset,
bond are too conservative.                                             conformational space can be covered to within 1.0 Å
  The second is that the granularity of the allowed tor-               with merely hundreds of conformations while just tens
sion settings may not be sufficiently fine to allow solu-              of conformations will achieve a coverage of 1.5 Å. Of
tions to be found to within a low RMSD cutoff. For                     course, these figures are expected to increase with each
example, a carbon-carbon single bond has 12 allowed                    additional rotatable bond.
torsion values from 0 to 360° in increments of 30°. If                   For completeness, Table 1 also reports median values
such a bond is centrally located in a large molecule,                  for the minimum RMSD to the crystal structure. How-
even if the crystal structure has similar torsion angles to            ever, as a metric for coverage these values give a mis-
one of these conformers the RMSD may differ                            leadingly positive picture compared to the percent
significantly.                                                         recovery values discussed above.
  Based on this dataset, the inherent granularity of the
Confab generated conformers is around 1.4 Å, as indi-                  Comparison with Multiconf-DOCK
cated by the “Diversity Cutoff” line in Figure 5(a) which              Multiconf-DOCK [13] is another open source conformer
falls off sharply as the RMSD diversity decreases below                generator that uses a torsion driving approach to imple-
1.4 Å. This line indicates the percent recovery at                     ment a systematic search to identify diverse low energy

Table 1 Relationship between the number of rotatable bonds, the number of conformers generated and the minimum
RMSD to the crystal structure
Rotatable Number of Total Conformers     Low Energy              Diverse Conformers (median)      Minimum RMSD to crystal (median)
 bonds†    molecules    (median)     Conformers (median)
                                                                0.5 Å 1.0 Å 1.5 Å 2.0 Å 3.0 Å 0.5 Å      1.0 Å   1.5 Å   2.0 Å   3.0 Å
    1           214                3                      3      3        1      1     1     1    0.18   0.40    0.45    0.45    0.45
    2           97                36                      25     8        2      1     1     1    0.34   0.54    0.74    0.80    0.80
    3           216               72                      44     19       4      1     1     1    0.39   0.70    1.02    1.06    1.06
    4           143              1296                    582     96       9      2     1     1    0.52   0.80    1.07    1.14    1.24
    5           86               3024                    1065   189       24     4     1     1    0.60   0.82    1.14    1.31    1.34
    6           114             186624                  24317   2953     192    24     5     1    0.71   0.90    1.21    1.49    1.78
    7           69              34992                   10679   1402     139    17     4     1    0.66   0.83    1.14    1.44    1.73
†
The 61 molecules with 8 or more rotatable bonds are omitted.




J. Cheminf. 2011, 3, 8.
O’Boyle et al. Journal of Cheminformatics 2011, 3:8                                                                 Page 7 of 9
http://www.jcheminf.com/content/3/1/8




conformers. This software differs in that it uses the            difficult to say whether this represents a less compre-
AMBER force field [28,29] (as implemented in DOCK5)              hensive coverage of conformational space or whether
instead of MMFF94. In addition, it implements perfor-            this is due to the use of different forcefields. In terms of
mance improvements such as search tree pruning by                the minimum RMSD to the crystal structure, once again
partial energy estimation [14]. Like Confab, the software        we see that Multiconf-DOCK performs better than Con-
requires a 3D structure as input.                                fab at the 2.0 Å and 1.5 Å RMSD diversity levels but
  Multiconf-DOCK was used to generate conformations              Confab is better at 1.0 Å RMSD diversity.
for the 1000 structures in the dataset using the same
input as for Confab but converted to MOL2 using Open             Distance Distribution in Conformations of a
Babel v2.3.0. It should be noted that the specified Sybyl        Phenyl Sulfone
atom types in the input MOL2 file have an effect on the          Many conformer generators are focused on reproducing
conformations generated by Multiconf-DOCK. The                   bioactive conformations. However it is worth remember-
parameters used were taken from the example provided             ing that the generation of conformers may also be useful
with the Multiconf-DOCK distribution, except that no             in other contexts. Here we use Confab to as an aid to
restriction was placed on the number of generated con-           interpret the NMR spectra for the phenyl sulfone shown
formations and the energy cutoff was set to 50 kcal/mol          in Figure 6. The peak for the methylene carbon of the
(as used for Confab). Three different RMSD diversity             ethyl ester was split unexpectedly (compared to an ana-
levels were investigated: 2.0 Å, 1.5 Å and 1.0 Å. For all        logous sulfone where the phenyl group was replaced by
three diversity levels, the mean time spent per molecule         tert-butyl), and our hypothesis was that this was due to
was 6.3 s (measured on the same machine used for                 the close approach of the methylene carbon to one of
Figure 4).                                                       the sulfonyl oxygens in solution. Confab was used to
  The performance in terms of percent recovery is as             investigate whether low energy conformations existed
follows: at 2.0 Å RMSD diversity, 99% are within 2.0 Å           where the methylene group was in close proximity to a
RMSD of the crystal structure (89% within 1.5, 55%               sulfonyl oxygen.
within 1.0); at 1.5 Å RMSD diversity, 99% are within 2.0           Confab was used to generate a set of conformations of
Å (97% within 1.5, 64% within 1.0); at 1.0 Å RMSD                the molecule with a diversity of 0.2 Å and no energy
diversity, 99% are within 2.0 Å (98% within 1.5, 80%             cutoff. The resulting 2014 conformations were optimised
within 1.0). These values are broadly similar to those for       using a MMFF94 forcefield (200 steps steepest descent;
Confab (see above). The most noticeable differences              implemented using Pybel [30]) and the final energy
occur for the percentage of structures found to within           recorded. For each of the conformations the minimum
1.0 Å RMSD; assuming that both programs successfully             distance between a sulfonyl oxygen and the methylene
remove conformations that are within the diversity cut-          carbon was measured.
off, Multiconf-DOCK outperforms Confab at the 2.0 Å                Figure 7 shows a plot of these distances versus the
and 1.5 Å RMSD diversity levels but Confab performs              relative energies of the conformers with marginal histo-
better at 1.0 Å RMSD diversity.                                  grams showing the distribution of values. The methylene
  Table 2 shows the median number of conformers gen-             carbon does not approach the sulfonyl group very clo-
erated by Multiconf-DOCK, along with the minimum                 sely. For low energy conformers, the distances are clus-
RMSD to the crystal structure, broken down by the                tered around 4.0 Å and 5.4 Å with the former more
number of rotatable bonds. Compared to Confab the                frequent. Taking 5 kcal/mol as a cutoff, the distance can
number of conformers generated is far fewer. It is               be as low as 3.7 Å but shorter distances (down to 3.0 Å)


Table 2 Results for Multiconf-DOCK showing the relationship between the number of rotatable bonds, the number of
conformers generated and the minimum RMSD to the crystal structure
Rotatable bonds            Diverse Conformers (median)                  Minimum RMSD to crystal (median)
                           1.0 Å             1.5 Å       2.0 Å          1.0 Å             1.5 Å             2.0 Å
1                          1                 1           1              0.34              0.40              0.40
2                          3                 1           1              0.50              0.67              0.71
3                          2                 1           1              0.68              0.78              0.81
4                          9                 3           1              0.76              0.97              1.05
5                          14                4           2              0.85              1.03              1.28
6                          43                15          5              1.08              1.23              1.37
7                          21                8           3              1.04              1.24              1.40




J. Cheminf. 2011, 3, 8.
O’Boyle et al. Journal of Cheminformatics 2011, 3:8                                                                           Page 8 of 9
http://www.jcheminf.com/content/3/1/8




                                                             that reduce the search space on the basis of heuristics
                                                             have been avoided for this reason.
                                                               Using the results from Confab 1.0 as a comparison,
                                                             future work will investigate strategies to to overcome
                                                             the combinatorial explosion associated with large num-
                                                             bers of rotatable bonds [31] including the trade-off
                                                             between speed and accuracy.

                                                             Availability and Requirements
                                                             Project name: Confab
                                                               Project home page: http://confab.googlecode.com
                                                               Operating system(s): Cross-platform
                                                               Programming language: C++
                                                               Other requirements (if compiling): CMake 2.4+,
                                                             Eigen2
                                                               Licence: GPL v2
                                                               Any restrictions to use by non-academics: None
 Figure 6 Structure of the phenyl sulfone studied.
                                                             Additional material

are only possible with an associated energy penalty.             Additional file 1: Crystal structures used to test conformational
Figure 6 shows one of the low energy conformations               coverage. This is a text file in SDF format containing biological
                                                                 conformations (as downloaded from PubChem) of 1000 molecules. This
(relative energy of 4.6 kcal/mol) which has a distance of
                                                                 is a subset of the data used in the study by Borodina et al.
3.7 Å between the groups of interest.
                                                                 Additional file 2: Generated 3D structures used to test
                                                                 conformational coverage. This is a text file in SDF format containing
Conclusion                                                       3D structures of the 1000 molecules in the dataset generated using
                                                                 Open Babel. These were used as the input to Confab.
The goal of this first release of Confab is to ensure com-
plete coverage of all of the low energy conformers of a
molecule. While every effort is made to maximise perfor-
mance, accuracy has been the main goal. Approximations
                                                             Acknowledgements and Funding
                                                             NMOB is supported by a Health Research Board Career Development
                                                             Fellowship, PD/2009/13. We thank several beta testers for their valuable
                                                             feedback, and the anonymous reviewers for their constructive comments.

                                                             Author details
                                                             1
                                                              Analytical and Biological Chemistry Research Facility, University College
                                                             Cork, Western Road, Cork, Co. Cork, Ireland. 2Open Babel development team.
                                                             3
                                                              Department of Chemistry, University of Pittsburgh, Chevron Science Center,
                                                             219 Parkman Avenue, Pittsburgh, PA 15260, USA.

                                                             Authors’ contributions
                                                             NMOB devised and implemented Confab, and carried out the coverage
                                                             analysis. GRH implemented the conformer generation framework in Open
                                                             Babel and contributed to the forcefield code. TV implemented the
                                                             automorphism code in Open Babel and contributed to the forcefield code.
                                                             NMOB collaborated with CJF and ARM on the sulfone investigation. All
                                                             authors read and approved the final manuscript.

                                                             Received: 9 February 2011 Accepted: 16 March 2011
                                                             Published: 16 March 2011

                                                             References
                                                             1. Schwab CH: Conformations and 3D pharmacophore searching. Drug
                                                                 Discov Today Tech 2010, 7:e245-e253.
                                                             2. Sadowski J, Gasteiger J, Klebe G: Comparison of Automatic Three-
                                                                 Dimensional Model Builders Using 639 X-Ray Structures. J Chem Inf
                                                                 Comput Sci 1994, 34(4):1000-1008.
 Figure 7 Scatterplot with marginal histograms of distance   3. Lagorce D, Pencheva T, Villoutreix BO, Miteva MA: DG-AMMOS: A New tool
 versus energy for the set of conformations of the phenyl        to generate 3D conformation of small molecules using Distance
 sulfone in Figure 6.                                            Geometry and Automated Molecular Mechanics Optimization for in
                                                                 silico Screening. BMC Chem Biol 2009, 9:6.




J. Cheminf. 2011, 3, 8.
O’Boyle et al. Journal of Cheminformatics 2011, 3:8                                                                                               Page 9 of 9
http://www.jcheminf.com/content/3/1/8




4.      Gilbert K, Guha R: smi23d. [http://www.chembiogrid.org/cheminfo/smi23d/].
5.      Hawkins PCD, Skillman AG, Warren GL, Ellingson BA, Stahl MT: Conformer
        Generation with OMEGA: Algorithm and Validation using High Quality
        Structures from the Protein Databank and Cambridge Structural
        Database. J Chem Inf Model 2010, 50:572-584.
6.      Renner S, Schwab CH, Gasteiger J, Schneider G: Impact of Conformational
        Flexibility on Three-Dimensional Similarity Searching Using Correlation
        Vectors. J Chem Inf Model 2006, 46:2324-2332.
7.      Catalyst. Accelrys Inc: San Diego, CA; [http://accelrys.com/].
8.      Confort. Tripos Inc: St Louis, MO; [http://www.tripos.com/].
9.      Watts KS, Dalal P, Murphy RB, Sherman W, Friesner RA, Shelley JC: ConfGen:
        A Conformational Search Method for Efficient Generation of Bioactive
        Conformers. J Chem Inf Model 2010, 50:534-546.
10.     Vainio MJ, Johnson MS: Generating Conformer Ensembles Using a
        Multiobjective Genetic Algorithm. J Chem Inf Model 2007, 47:2462-2474.
11.     Sperandio O, Souaille M, Delfaud F, Miteva MA, Villoutreix BO: MED-3DMC:
        A new tool to generate 3D conformation ensembles of small molecules
        with a Monte Carlo sampling of the conformational space. Eur J Med
        Chem 2009, 44:1405-1409.
12.     Miteva MA, Guyon F, Tufféry P: Frog2: Efficient 3D conformation
        ensemble generator for small compounds. Nucleic Acids Res 2010, 38:
        W622-W627.
13.     Sauton N, Lagorce D, Villoutreix BO, Miteva MA: MS-DOCK: Accurate
        multiple conformation generator and rigid docking protocol for multi-
        step virtual ligand screening. BMC Bioinformatics 2008, 9:184.
14.     Makino S, Kuntz ID: Automated flexible ligand docking method and its
        application for database search. J Comput Chem 1997, 18:1812-1825.
15.     Huang N, Shoichet BK, Irwin JJ: Benchmarking Sets for Molecular Docking.
        J Med Chem 2006, 49:6789-6801.
16.     Halgren TA: Merck molecular force field. I. Basis, form, scope,
        parameterization, and performance of MMFF94. J Comp Chem 1996,
        17:490-519.
17.     Theobald DL: Rapid calculation of RMSDs using a quaternion-based
        characteristic polynomial. Acta Cryst A 2005, 61:478-480.
18.     Kabsch W: A solution for the best rotation to relate two sets of vectors.
        Acta Cryst A 1976, 32:922-923.
19.     Hutchison GR, Morley C, Vandermeersch T, O’Boyle NM, James C, et al:
        Open Babel, v2.3. [http://openbabel.org].
20.     Free Software Foundation: GNU General Public License, v2. [http://www.
        gnu.org/licenses/old-licenses/gpl-2.0.html].
21.     Theobald DL: QCProt, v1.1. [http://theobald.brandeis.edu/qcp/].
22.     Guennebaud G, Jacob B, et al: Eigen, v2.0.15. [http://eigen.tuxfamily.org].
23.     Peeters K: tree.hh, v2.65. [http://tree.phi-sci.com/].
24.     Linear feedback shift register. Wikipedia [http://en.wikipedia.org/wiki/
        Linear_feedback_shift_register], Retrieved Aug 11, 2010.
25.     Alfke P: Efficient Shift Registers, LFSR Counters, and Long Pseudo-
        Random Sequence Generators. Xilinx application note 1996 [http://www.
        xilinx.com/support/documentation/application_notes/xapp052.pdf],
        XAPP052.
26.     Borodina YV, Bolton E, Fontaine F, Bryant SH: Assessment of
        Conformational Ensemble Sizes Necessary for Specific Resolutions of
        Coverage of Conformational Space. J Chem Inf Model 2007, 47:1428-1437.
27.     Smellie A, Kahn SD, Teig SL: Analysis of Conformational Coverage. 1.
        Validation and Estimation of Coverage. J Chem Inf Comput Sci 1995,
        35(2):285-294.
28.     Weiner SJ, Kollman PA, Case DA, Singh UC, Ghio C, Alagona G, Profeta S Jr,
        Weiner P: A new force field for molecular mechanical simulation of

29.
        nucleic acids and proteins. J Am Chem Soc 1984, 106:765-784.
        Weiner SJ, Kollman PA, Nguyen DT, Case DA: An all atom force field for
                                                                                      Publish with ChemistryCentral and every
        simulations of proteins and nucleic acids. J Comput Chem 1986,                scientist can read your work free of charge
        7:230-252.
                                                                                                Open access provides opportunities to our
30.     O’Boyle NM, Morley C, Hutchison GR: Pybel: a Python wrapper for the
        OpenBabel cheminformatics toolkit. Chem Cent J 2008, 2:5.                           colleagues in other parts of the globe, by allowing
31.     Beusen DD, Shands EFB, Karasek SF, Marshall GR, Dammkoehler RA:                         anyone to view the content free of charge.
        Systematic search in conformational analysis. J Mol Struct THEOCHEM                                        W. Jeffery Hurst, The Hershey Company.
        1996, 370:157-171.
                                                                                        available free of charge to the entire scientific community
     doi:10.1186/1758-2946-3-8                                                          peer reviewed and published immediately upon acceptance
     Cite this article as: O’Boyle et al.: Confab - Systematic generation of
     diverse low-energy conformers. Journal of Cheminformatics 2011 3:8.                cited in PubMed and archived on PubMed Central
                                                                                        yours you keep the copyright
                                                                                      Submit your manuscript here:
                                                                                      http://www.chemistrycentral.com/manuscript/




J. Cheminf. 2011, 3, 8.
My Open Access papers
O’Boyle Journal of Cheminformatics 2011, 3:10
   http://www.jcheminf.com/content/3/1/10




    BOOK REPORT                                                                                                                                    Open Access

   Review of “Data Analysis with Open Source
   Tools” by Philipp K Janert
   Noel M O’Boyle


   Book details Janert PK: Data Analysis with Open                                       day-to-day; essentially classical methods were developed
   Source Tools Sebastopol, CA: O’Reilly Media 2010                                      at a time of small and expensive datasets and no com-
   Cheminformatics has been defined as the application of                                putational power, and hypothesis testing focused on
   informatics methods to solve chemical problems [1].                                   determining whether an effect existed. Today we have
   Such chemical problems are often represented in terms                                 ample computing power and may be dealing with very
   of data, be it activity data for a series of compounds or                             large datasets; also, we are usually more interested in
   descriptor values for a compound library. While this                                  the size of an effect (practical significance) rather than
   new book from the O’Reilly stable is not aimed specifi-                               just whether it exists (statistical significance).
   cally at cheminformaticians, the subtitle of “A Hands-                                  Topics that could not be squeezed into a chapter
   On Guide for Programmers and Data Scientists” makes                                   proper have been placed in shorter “Intermezzos” at the
   it clear that the target audience includes any scientists                             end of each section. For example, a short section on
   whose day-to-day work involves analysing and interpret-                               “What about map/reduce?” at the end of “Mining Data”
   ing data.                                                                             reminds the reader that the map/reduce methodology
      The book is broadly divided into four parts on                                     (much hyped recently) is not a clever algorithm to
   Graphics: Looking at Data, Analytics: Modeling Data,                                  speed things up, but rather a piece of infrastructure that
   Computation: Mining Data and Applications: Using                                      makes it convenient to implement algorithms that are
   Data. First of all, it should be noted that this is not a                             trivially parallelisable.
   book about statistics (as Chapter 1 states explicitly).                                 On the negative side, any cheminformatician who has
   Neither is it a manual for numpy, Sage, matplotlib,                                   been involved with QSAR studies will already be familiar
   Gnuplot, R and so forth, as might be implied by the                                   with the multivariate analysis methods discussed here
   title. Instead, Janert focuses on discussing data analysis                            (Chapters 13 and 14), although I liked the observation
   methods and techniques in depth, rather than skimming                                 that “you will actually spend more time on data sets that
   topics by following a cookbook or tutorial approach                                   are totally worthless” in relation to clustering algo-
   linked to particular software. This is as it should be -                              rithms. Also there are two chapters (out of 19) which
   there are already documentation and manuals available                                 will be of little interest as they focus on business intelli-
   for all of these programs, and the reader is simply                                   gence and financial calculations, although even there the
   alerted to the availability of the software, its capabilities                         reader will find an introduction to the use of Berkeley
   are described and some examples of use shown.                                         DB and SQLite from Python, tools which I highly
      This is a real practitioner’s book. Janert, a former                               recommend. There are also cases where the author per-
   physicist and software engineer, is a consultant in data                              haps gives too much detail, but this is hardly a criticism -
   analysis and mathematical modelling. He has taken his                                 in a book of some 500 pages there is plenty of room.
   hard-won knowledge and tried to get it all down on                                      Overall though, I heartily recommend this book to
   paper for the reader’s benefit. For example, in a chapter                             anyone working in cheminformatics whether they
   with the provocative title of “What you really need to                                develop methods or apply them. Too often we rely on
   know about classical statistics” he explains why intro-                               summary statistics such as mean and standard deviation
   ductory statistics textbooks seem to cover methods and                                and forget to actually look at the data. Graphical analy-
   topics at odds with the problems data analysts deal with                              sis gives you a feel for the data, and can often highlight
                                                                                         problems, interesting features, or mistaken assumptions.
   Correspondence: baoilleach@gmail.com                                                  After reading this book, you should be very aware of
   Analytical and Biological Chemistry Research Facility, University College Cork,       both the advantages and pitfalls of a wide variety of
   Western Road, Cork, Co. Cork, Ireland

                                            © 2011 O’Boyle; licensee Chemistry Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
                                            Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
                                            any medium, provided the original work is properly cited.




J. Cheminf. 2011, 3, 10.
O’Boyle Journal of Cheminformatics 2011, 3:10                                                                                            Page 2 of 2
http://www.jcheminf.com/content/3/1/10




analysis methods but you will also be reminded that the
goal of data analysis is not a picture or a number but
insight.


Competing interests
The author declares that they have no competing interests.

Received: 8 March 2011 Accepted: 24 March 2011
Published: 24 March 2011

Reference
1. Gasteiger J: Introduction. In Chemoinformatics - A Textbook. Edited by:
    Gasteiger J, Engel T. Weinheim: Wiley-VCH; 2003:1-13.

 doi:10.1186/1758-2946-3-10
 Cite this article as: O’Boyle: Review of “Data Analysis with Open Source
 Tools” by Philipp K Janert. Journal of Cheminformatics 2011 3:10.




                                                                             Publish with ChemistryCentral and every
                                                                             scientist can read your work free of charge
                                                                                       Open access provides opportunities to our
                                                                                   colleagues in other parts of the globe, by allowing
                                                                                       anyone to view the content free of charge.
                                                                                                          W. Jeffery Hurst, The Hershey Company.
                                                                               available free of charge to the entire scientific community
                                                                               peer reviewed and published immediately upon acceptance
                                                                               cited in PubMed and archived on PubMed Central
                                                                               yours you keep the copyright
                                                                             Submit your manuscript here:
                                                                             http://www.chemistrycentral.com/manuscript/




J. Cheminf. 2011, 3, 10.
O’Boyle et al. Journal of Cheminformatics 2011, 3:37
   http://www.jcheminf.com/content/3/1/37




    RESEARCH ARTICLE                                                                                                                              Open Access

   Open Data, Open Source and Open Standards in
   chemistry: The Blue Obelisk five years on
   Noel M O’Boyle1*, Rajarshi Guha2, Egon L Willighagen3, Samuel E Adams4, Jonathan Alvarsson5,
   Jean-Claude Bradley6, Igor V Filippov7, Robert M Hanson8, Marcus D Hanwell9, Geoffrey R Hutchison10,
   Craig A James11, Nina Jeliazkova12, Andrew SID Lang13, Karol M Langner14, David C Lonie15, Daniel M Lowe4,
   Jérôme Pansanel16, Dmitry Pavlov17, Ola Spjuth5, Christoph Steinbeck18, Adam L Tenderholt19, Kevin J Theisen20
   and Peter Murray-Rust4


     Abstract
     Background: The Blue Obelisk movement was established in 2005 as a response to the lack of Open Data, Open
     Standards and Open Source (ODOSOS) in chemistry. It aims to make it easier to carry out chemistry research by
     promoting interoperability between chemistry software, encouraging cooperation between Open Source
     developers, and developing community resources and Open Standards.
     Results: This contribution looks back on the work carried out by the Blue Obelisk in the past 5 years and surveys
     progress and remaining challenges in the areas of Open Data, Open Standards, and Open Source in chemistry.
     Conclusions: We show that the Blue Obelisk has been very successful in bringing together researchers and
     developers with common interests in ODOSOS, leading to development of many useful resources freely available
     to the chemistry community.


   Background                                                                           molecules was created as a resource about chemical
   The Blue Obelisk movement was established in 2005 at                                 structure and nomenclature by biologists [1].
   the 229th National Meeting of the American Chemistry                                   The formation of the Blue Obelisk group is somewhat
   Society as a response to the lack of Open Data, Open                                 unusual in that it is not a funded network, nor does it
   Standards and Open Source (ODOSOS) in chemistry.                                     follow the industry consortium model. Rather it is a
   While other scientific disciplines such as physics, biol-                            grassroots organisation, catalysed by an initial core of
   ogy and astronomy (to name a few) were embracing                                     interested scientists, but with membership open to all
   new ways of doing science and reaping the benefits of                                who share one or more of the goals of the group:
   community efforts, there was little if any innovation in
   the field of chemistry and scientific progress was actively                               • Open Data in Chemistry. One can obtain all
   hampered by the lack of access to data and tools. Since                                   scientific data in the public domain when wanted
   2005 it has become evident that a good amount of                                          and reuse it for whatever purpose.
   development in open chemical information is driven by                                     • Open Standards in Chemistry. One can find visi-
   the demands of neighbouring scientific fields. In many                                    ble community mechanisms for protocols and com-
   areas in biology, for example, the importance of small                                    municating information. The mechanisms for
   molecules and their interactions and reactions in biolo-                                  creating and maintaining these standards cover a
   gical systems has been realised. In fact, one of the first                                wide spectrum of human organisations, including
   free and open databases and ontologies of small                                           various degrees of consent.
                                                                                             • Open Source in Chemistry. One can use other
                                                                                             people’s code without further permission, including
   * Correspondence: baoilleach@gmail.com
   1
    Analytical and Biological Chemistry Research Facility, Cavanagh Pharmacy
                                                                                             changing it for one’s own use and distributing it
   Building, University College Cork, College Road, Cork, Co. Cork, Ireland                  again.
   Full list of author information is available at the end of the article

                                          © 2011 O’Boyle et al; licensee Chemistry Central Ltd. This is an Open Access article distributed under the terms of the Creative
                                          Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
                                          reproduction in any medium, provided the original work is properly cited.




J. Cheminf. 2011, 3, 37.
O’Boyle et al. Journal of Cheminformatics 2011, 3:37                                                          Page 2 of 15
http://www.jcheminf.com/content/3/1/37




  Note that while some may advocate also for Open                it should be straightforward to develop spectral
Access to publications, the Blue Obelisk goals (ODO-             annotation and manipulation. However, currently
SOS) focus more on the availability of the underlying            the Blue Obelisk lacks support for multi-dimensional
scientific data, standards (to exchange data), and code          NMR and multi-equipment spectra (e.g. GC-MS).
(to reproduce results). All three of these goals stem            5. Crystallography: The Blue Obelisk software sup-
from the fundamental tenants of the scientific method            ports the bi-directional processing of crystal struc-
for data sharing and reproducibility.                            ture files (CIF) and also solid-state calculations such
  The Blue Obelisk was first described in the CDK                as plane-waves with periodic boundary conditions.
News [2] and later as a formal paper by Guha et al. [3]          There is considerable support for the visualisation of
in 2006. Its home on the web is at http://blueobelisk.           both periodic and aperiodic condensed objects.
org. This contribution looks back on the work carried
out by the Blue Obelisk over the past 5 years in the            Many of the current operations in installing and run-
areas of Open Data, Open Source, and Open Standards           ning chemical computations and using the data are inte-
in chemistry.                                                 gration and customisation rather than fundamental
                                                              algorithms. It is very difficult to create universal plat-
Scope                                                         forms that can be distributed and run by a wide range
The Blue Obelisk covers many areas of chemistry and           of different users, and in general, the Blue Obelisk delib-
chemical resources used by neighbouring disciplines (e.g.     erately does not address these. Our approach is to pro-
biochemistry, materials science). Many of the efforts         duce components that can be embedded in many
relate to cheminformatics (the scope of this journal) and     environments, from stand-alone applications to web
we believe that many of the publications in Journal of        applications, databases and workflows. We believe that a
Cheminformatics could be completely carried out using         chemical laboratory with reasonable access to common
Blue Obelisk resources and other Open Source chemical         software engineering techniques should be able to build
tools. The importance of this is that for the first time it   customised applications using Blue Obelisk components
would allow reviewers, editors and readers to validate        and standard infrastructure such as workflows and data-
assertions in the journal and also to re-run and re-ana-      bases. Where the Blue Obelisk itself produces data
lyse parts of the calculation.                                resources they are normally done with Open compo-
  However, Blue Obelisk software and data is also used        nents so that the community can, if necessary, replicate
outside cheminformatics and certainly in the five main        them.
areas that, for example, Chemical Markup Language               Much of the impetus behind Blue Obelisk software is
(CML) [4] supports:                                           to create an environment for chemical computation
                                                              (including cheminformatics) where all of the compo-
    1. Molecules: This is probably the largest area for       nents, data, specifications, semantics, ontology and soft-
    Blue Obelisk software and data, and is reflected by       ware are Openly visible and discussable. The largest
    many programs that visualise, transform, convert          current uses by the general chemical community are in
    formats and calculate properties. It is almost certain    authoring, visualisation and cheminformatics calcula-
    that any file format currently in use can be pro-         tions but we anticipate that this will shortly extend into
    cessed by Blue Obelisk software and that properties       mainstream computational chemistry and solid-state.
    can be calculated for most (organic compounds).           Although many of the authors are employed as research
    2. Reactions: Blue Obelisk software can describe the      scientists, there are also several people who contribute
    semantics of reactions and provide atom-atom              in their spare time and we anticipate an increasing value
    matching and analyse stoichiometric balance in            and use of the Blue Obelisk in education at all levels.
    reactions.
    3. Computational chemistry: Blue Obelisk software         Open Source
    can interpret many of the current output files from       The development of Open Source software has been one
    calculations and create input for jobs. The Quixote       of the most successful of the Blue Obelisk’s activities.
    project (see below and elsewhere in this issue) shows     The following sections describe recent work in this area,
    that Open Source approaches based on Blue Obelisk         and Table 1 provides an overview of the projects dis-
    resources and principles are increasing the availabil-    cussed and where to find them online.
    ity and re-usability of computational chemistry.
    4. Spectra: 1-D spectra (NMR, IR, UV etc.) are fully      Cheminformatics toolkits
    supported in Blue Obelisk offerings for conversion        Open Source toolkits for cheminformatics have now
    and display. There is a limited amount of spectral        existed for nearly ten years. During this period, some
    analysis but the software gives a platform on which       toolkits were developed from scratch in academia,



J. Cheminf. 2011, 3, 37.
O’Boyle et al. Journal of Cheminformatics 2011, 3:37                                                                           Page 3 of 15
http://www.jcheminf.com/content/3/1/37




Table 1 Blue Obelisk Open Source Software projects discussed in the text
                             Name                                                                   Website
                                                               CML Tools
                            CMLXOM                                                  https://bitbucket.org/wwmm/cmlxom/
                             JUMBO                                                    http://sourceforge.net/projects/cml/
                                                        Cheminformatics Toolkits
               Chemistry Development Kit (CDK)                                                  http://cdk.sf.net
                            Cinfony                                                     http://cinfony.googlecode.com
                             Indigo                                               http://ggasoftware.com/opensource/indigo
                             JOELib                                                       http://sf.net/projects/joelib
                          Open Babel                                                         http://openbabel.org
                              RDKit                                                              http://rdkit.org
                                                           Web Applications
                ChemDoodle Web Components                                                http://web.chemdoodle.com
                              Jmol                                                               http://jmol.org
                                                              Integration
                            Bioclipse                                                      http://www.bioclipse.net
                          CDK-Taverna                                                  http://cdktaverna.wordpress.com
                           Lensfield2                                                https://bitbucket.org/sea36/lensfield2/
                                                            Interconversion
                          CIFXOM [95]                                                https://bitbucket.org/wwmm/cifxom/
                       JUMBO-Converters                                         https://bitbucket.org/wwmm/jumbo-converters/
                             OPSIN                                                         http://opsin.ch.cam.ac.uk
                              OSRA                                                              http://osra.sf.net
                                                          Structure Databases
                             Bingo                                                 http://ggasoftware.com/opensource/bingo
                     Chempound (Chem#)                                                 https://bitbucket.org/chempound
                            Mychem                                                           http://mychem.sf.net
                            OrChem                                                            http://orchem.sf.net
                            pgchem                                                  http://pgfoundry.org/projects/pgchem/
                                                              Text mining
                      ChemicalTagger [96]                                             http://chemicaltagger.ch.cam.ac.uk/
                            OSCAR4                                                   https://bitbucket.org/wwmm/oscar4/
                                                        Computational Chemistry
                           Avogadro                                                   http://avogadro.openmolecules.net
                              cclib                                                            http://cclib.sf.net
                           GaussSum                                                          http://gausssum.sf.net
                            QMForge                                                          http://qmforge.sf.net
                                                       Computational Drug Design
                           Confab [97]                                                  http://confab.googlecode.com
                             Pharao                                                        http://silicos.be/download
                             Piramid                                                       http://silicos.be/download
                              Sieve                                                        http://silicos.be/download
                            Stripper                                                       http://silicos.be/download
                                                           Other Applications
                            AMBIT2                                                             http://ambit.sf.net
                             Brunn                                                             http://brunn.sf.net
                             Toxtree                                                          http://toxtree.sf.net
                             XtalOpt                                                   http://xtalopt.openmolecules.net




J. Cheminf. 2011, 3, 37.
O’Boyle et al. Journal of Cheminformatics 2011, 3:37                                                        Page 4 of 15
http://www.jcheminf.com/content/3/1/37




whereas others were made Open Source by releasing in-       Second-generation tools
house codebases under liberal licenses. When the Blue       Although feature-rich and robust cheminformatics
Obelisk was established five years ago, the primary         toolkits are useful in and of themselves, they can also be
toolkits under active development were the Chemistry        seen as providing a base layer on which additional tools
Development Kit (CDK) [5,6], Open Babel [7], and JOE-       and applications can be built. This is one of the reasons
Lib [8]. Of these, both the CDK and Open Babel con-         that cheminformatics toolkits are so important to the
tinue to be actively developed.                             open source ‘ecosystem’; their availability lowers the bar-
   The CDK project has been under regular development       rier for the development of a ‘second generation’ of
over the last five years. Several features have been        chemistry software that no longer needs to concern
implemented ranging from core components such as an         itself with the low-level details of manipulating chemical
extensible SMARTS matching system and a new graph           structures, and can focus on providing additional func-
(and subgraph) isomorphism method [9], to more appli-       tionality and ease-of-use. Although a wide range of
cation oriented components such as 3D pharmacophore         chemistry software has been built using Blue Obelisk
searching and matching, and a variety of structural-key     components (see for example, the “Related Software”
and hashed fingerprints. In addition, there have been a     link on the Open Babel website, [13] listing over 40 pro-
number of second generation tools developed on top of       jects as of this writing, or “Software using CDK” at the
the CDK (see below). As well as the use of the CDK in       CDK website), in this section we focus on second-gen-
various tools, it has been deployed in the form of web      eration tools which themselves have been developed by
services [10] and has formed the basis of a variety of      members of the Blue Obelisk.
web applications.                                              Bioclipse [14] (v2.4 released in Aug 2010) and Avoga-
   Since 2006, major new features of Open Babel include     dro [15] (v1.0 in Oct 2009) are two examples of such
3D structure generation and 2D structure-diagram gen-       software, based on the CDK and Open Babel, respec-
eration, UFF and MMFF94 forcefields, and significantly      tively. Bioclipse (Figure 1) is an award-winning molecu-
expanded support for computational chemistry calcula-       lar workbench for life sciences that wraps
tions. In addition, a major focus of Open Babel develop-    cheminformatics functionality behind user-friendly inter-
ment has been to provide for accurate conversion and        faces and graphical editors while Avogadro (Figure 2) is
representation in areas of stereochemistry, kekulisation,   a 3D molecular editor and viewer aimed at preparing
and canonicalisation. The project has also grown, in        and analysing computational chemistry calculations.
terms of new contributors, new support from commer-         Both projects are designed to be extended or scripted by
cial companies, and second-generation tools applying        users through the provision of a plugin architecture and
Open Babel to a variety of end-user applications, from      scripting support (using Bioclipse Scripting Language
molecular editors to chemical database systems.             [16], or Python in the case of Avogadro). An interesting
   Two new Open Source cheminformatics toolkits have        aspect of both Avogadro and Bioclipse is that they share
appeared since the original paper. In 2006 Rational Dis-    some developers with the underlying toolkits and this
covery, a cheminformatics service company (since closed     has driven the development of new features in the CDK
down), released RDKit [11] under the BSD License. This      and Open Babel.
is a C++ library with Python and (more recently) Java          Both products in turn act as extensible platforms for
bindings. RDKit is actively developed and includes code     other software. Bioclipse, for example is used by soft-
donated by Novartis. Recent developments include the        ware such as Brunn [17], a laboratory information sys-
Java bindings, as well as performance improvements for      tem for microplate based high-throughput screening.
its database cartridge.                                     Brunn provides a graphical interface for handling differ-
   More recently, GGA Software Services (a contract         ent plate layouts and dilution series and can automati-
programming company) released the Indigo toolkit [12]       cally generate dose response curves and calculate IC50-
and associated software in 2009 under the GPL. Indigo       values. Avogadro is used by Kalzium [18], a periodic
is a C++ library with high-level wrappers in C, Java,       table and chemical editor in KDE, and XtalOpt [19,20],
Python, and the .NET environment. Like RDKit and            an evolutionary algorithm for crystal structure predic-
other toolkits, Indigo provides support for tetrahedral     tion. XtalOpt provides a graphical interface using Avo-
and cis-trans stereochemistry, 2D coordinate generation,    gadro and submits calculations using a range of solid-
exact/substructure/SMARTS matching, fingerprint gen-        state simulation software to predict stable polymorphs.
eration, and canonical SMILES computation. It also pro-        A final example of second-generation Blue Obelisk
vides some less common functionality, like matching         software is the AMBIT2 [21,22] software, which was
tautomers and resonance substructures, enumeration of       developed to facilitate registration of chemicals for the
subgraphs, finding maximum common substructure of           REACH EU directive, and is based on the CDK. It was
N input structures, and enumerating reaction products.      distributed initially as a standalone Java Swing GUI, and



J. Cheminf. 2011, 3, 37.
O’Boyle et al. Journal of Cheminformatics 2011, 3:37                                                                  Page 5 of 15
http://www.jcheminf.com/content/3/1/37




 Figure 1 Screenshot of Bioclipse using Jmol to visualise a molecular surface.


more recently as downloadable web application archive,                predictive models, including modules of the open source
offering a web services interface to a searchable chemi-              Toxtree [22-24] software for toxicity prediction.
cal structures database. Also integrated are descriptor
calculations, as well as the ability to run and build                 Computational chemistry analysis
                                                                      Another area where the Blue Obelisk has had a signifi-
                                                                      cant impact in the past five years is in supporting quan-
                                                                      tum chemistry calculations and in interpreting their
                                                                      results. Electronic structure calculations have a long tra-
                                                                      dition in the chemistry community and a variety of pro-
                                                                      grams exist, mostly proprietary software but with an
                                                                      increasing number of open source codes. However,
                                                                      since each program uses different input formats, and the
                                                                      the output formats vary widely (sometimes even varying
                                                                      between different versions of the same software), prepar-
                                                                      ing calculations and automatically extracting the results
                                                                      is problematic.
                                                                         Avogadro has already been mentioned as a GUI for
                                                                      preparing calculations. It uses Open Babel to read the
                                                                      output of several electronic structure packages. Avoga-
                                                                      dro generates input files on the fly in response to user
                                                                      input on forms, as well as allowing inline editing of the
 Figure 2 Screenshot of Avogadro showing a depiction of a
 carbon nanotube.                                                     files before they are saved to disk. It also features intui-
                                                                      tive syntax highlighting for GAMESS input files,




J. Cheminf. 2011, 3, 37.
O’Boyle et al. Journal of Cheminformatics 2011, 3:37                                                          Page 6 of 15
http://www.jcheminf.com/content/3/1/37




allowing expert users to easily spot mistakes before sav-      Quixote will advertise the value of Open community
ing an input file to disk.                                     standards for semantics to the world.
   In addition to this, significant development of new           The Quixote project is not dependent on any particu-
parsing routines took place in an Avogadro plugin to           lar technology, other than the representation of compu-
read in basis sets and electronic structure output in          tational chemistry in CML and the management of
order to calculate molecular orbital and electron density      semantics through CML dictionaries. At present, we use
grids. This code was written to be parallel, using desk-       JUMBO-Converters [29] for most of the semantic con-
top shared memory parallelism and high level APIs in           version, Lensfield2 [30] for the workflow and Chem-
order to significantly speed up analysis. Most of this         pound (chem#) [31] to store and disseminate the results.
code was recently separated from the plugin, and
released as a BSD licensed library, OpenQube, which is         Web applications
now used by the latest version of Avogadro. Jmol (see          While desktop software has composed the majority of
below) can also depict computational chemistry results         scientific tools since the computer was introduced, the
including molecular orbitals.                                  internet continues to change how applications and con-
   In 2006, the Blue Obelisk project cclib [25] was estab-     tent are distributed and presented. The web presents
lished with the goal of parsing the output from compu-         new opportunities for scientists as it is an open and free
tational chemistry programs and presenting it in a             medium to distribute scientific knowledge, ideas and
standard way so that further analyses could be carried         education. Web applications are software that runs
out independently of the quantum package used. cclib is        within the browser, typically implemented in Java or
a Python library, and the current version (version 1.0.1)      JavaScript. Recently, a new version of the HTML specifi-
supports 8 different computational chemistry codes and         cation, HTML5, defined a well-developed framework for
extracts over 30 different calculated attributes. Two          creating native web applications in JavaScript and this
related Blue Obelisk projects build upon cclib. Gauss-         opens up new possibilities for visualising chemical data.
Sum [26], is a GUI that can monitors the progress of             Jmol, the interactive 3D molecular viewer, is one of
SCF and geometry convergences, and can plot predicted          the most widely used chemistry applets, and indeed has
UV/Vis absorption and infrared spectra from appropri-          seen widespread use in other fields such as biology and
ate logfiles containing energies and oscillator strengths      even mathematics (it is used for 3D depiction of mathe-
for easy comparison to experimental data. QMForge              matical functions in the Sage Mathematics Projects
[27] provides a GUI for various electronic structure ana-      [32]). It is implemented in Java, and has gone from
lyses such as Frenking’s charge decomposition analysis         being a “Rasmol/Chime” replacement to a fully fledged
[28] and Mulliken or C-squared analyses on user-               molecular visualisation package, including full support
defined molecular fragments. QMForge also provides a           for crystallography [33], display of molecular orbitals
rudimentary Cartesian coordinate editor allowing mole-         from standard basis set/coefficient data, the inclusion of
cular structures to be saved via Open Babel.                   dynamic minimisation using the UFF force field, and a
   The Quixote project epitomises the full use of the          full implementation of Daylight SMILES and SMARTS,
Blue Obelisk software and is described in detail in            with extensions to conformational and biomolecular
another article in this issue. Here we observe that it is      substructure searching (Jmol BioSMARTS).
possible to convert legacy chemistry file formats of all         In 2009, iChemLabs released the ChemDoodle Web
sorts into semantic chemistry and extract those parts          Components library [34] under the GPL v3 license (with
which are suitable for input to computational chemistry        a liberal HTML exception). This library is completely
programs. This chemistry is then combined with generic         implemented in JavaScript and uses HTML5 to allow
concepts of computational chemistry (e.g. strategy,            the scientist to present publication quality 2D and 3D
machine resources, timing, accuracy etc.) into the legacy      graphics (see Figure 3) and animations for chemical
inputs for a wide range of programs. Quixote itself fol-       structures, reactions and spectra. Beyond graphics, this
lows Blue Obelisk principles in that it does not manage        tool provides a framework for user interaction to create
the submission and monitoring of jobs but resumes              dynamic applications through web browsers, desktop
action when the jobs have been completed, and then             platforms and mobile devices such as the iPhone, iPad
applies a range of parsing and transformation tools to         and Android devices.
create standardised semantic chemical content. A major
feature of Quixote is that it requires all concepts to vali-   The business end
date against dictionaries and the process of parsing files     Open Source provides a unique opportunity for com-
necessarily generates communally-agreed dictionaries,          mercial organisations to work with the cheminformatics
which represent an important step forward in the Open          community. Traditional business models rely on moneti-
specifications for Blue Obelisk. When widely-deployed,         sation of source code, causing companies to repeat work



J. Cheminf. 2011, 3, 37.
O’Boyle et al. Journal of Cheminformatics 2011, 3:37                                                         Page 7 of 15
http://www.jcheminf.com/content/3/1/37




                                                             exchange. This way, licensing issues are becoming a
                                                             marginal problem, allowing companies to select a license
                                                             appropriate for their business model. This too, allows a
                                                             company to create a successful product with signifi-
                                                             cantly reduced cost and effort.
                                                                At the time of writing there are many commercial
                                                             companies developing chemistry solutions around Open
                                                             Source cheminformatics components provided by the
                                                             Blue Obelisk community. Examples of such companies
                                                             include iChemLabs, IdeaConsult, Wingu, Silicos, Genet-
                                                             taSoft, eMolecules, hBar, Metamolecular, and Inkspot
 Figure 3 Screenshot of the MolGrabber 3D demo from          Science. Some of these merely use components, but sev-
 ChemDoodle Web Components.                                  eral actively contribute back to the Blue Obelisk project
                                                             they use, or donate new Open Source cheminformatics
                                                             projects to the community.
done by other companies. This model is sometimes                For example, iChemLabs released the ChemDoodle
combined with a free (gratis) model for people working       Web Components library under the GPL v3 license,
at academic institutes, to increase adoption and encou-      based on the upcoming HTML5 Open Standard. It
rage contributions from academics. This solution defines     allows making web and mobile interfaces for chemical
the return on investment as the IP on the software, but      content. The project is already being adopted by others,
has the downside of investment losses due to duplica-        including iBabel [39], ChemSpotlight [40] and the RSC
tion of software and method development, which               ChemSpider [41,42].
become visible when proprietary companies merge.                Silicos has released several Open Source utilities [43]
Some authors have argued that in the chemistry field         based on Open Babel, such as Pharao, a tool for phar-
few contributors are available to volunteer time to          macophore searching, Sieve for filtering molecular struc-
improve codes and IP considerations may prevent con-         ture by molecular property, Stripper for removing core
tributions from industry [35]. If true, this would hamper    scaffold structures from a molecule set, and Piramid for
adoption of Open Source and Open Data in chemistry,          molecular alignment using shape determined by the
and greatly slow the growth of projects such as those in     Gaussian volumes as a descriptor. Additionally, contri-
the Blue Obelisk.                                            butions have been made to the Open Babel project
  The Blue Obelisk community, however, takes advan-          itself.
tage of the fact that much of the investment needed for         Other companies use Blue Obelisk components and
development is either paid for by academic institutes        contribute patches, smaller and larger. For example,
and funding schemes, or by volunteers investing time         IXELIS donated the isomorphism code in the CDK,
and effort. In return, contributors get full access to the   eMolecules donated canonicalisation code to Open
source code, and the Open Source licensing ensures           Babel, Metamolecular improved the extensibility and
that they will have access any time in the future. In this   unit testing suite of OPSIN, and AstraZeneca contribu-
way, the license functions as a social contract between      ted code to the CDK for signatures. This is just a very
everyone to arrange an immediate return on investment.       minor selection, and the reader is encouraged to contact
Effiectively, this approach shares the burden of the high    the individual Blue Obelisk projects for a detailed list.
investment in having to develop cheminformatics soft-           In May 2011, a Wellcome Trust Workshop on Mole-
ware from scratch, allowing researchers and commercial       cular Informatics Open Source Software (MIOSS)
partners alike to focus on their core business, rather       explored the role of Open Source in industrial labora-
than the development of prerequisites. In the case of the    tories and companies as well as academia (several of the
Blue Obelisk, the rich collection of Open Source che-        presenters are among the authors of this paper). The
minformatics tools provided greatly reduces investment       meeting identified that Open Source software was extre-
up front for new companies in the cheminformatics            mely valuable to industry not just because it is available
market. Such advantages have also been noted in the          for free, but because it allows the validation of source
drug discovery field [36-38].                                code, data and computational procedures. Some of the
  The use of Open Standards allows everyone to select        discussion was on business models or other ways to
those Blue Obelisk components they find most useful,         maintain development of Open Source software on
as they can easily replace one component with another        which a business relied. Companies are concerned about
providing the same functionality, taking advantage that      training and support and, in some cases, product liabi-
they use the same standards for, for example, data           lity. There are difficulties for software for which there is



J. Cheminf. 2011, 3, 37.
O’Boyle et al. Journal of Cheminformatics 2011, 3:37                                                         Page 8 of 15
http://www.jcheminf.com/content/3/1/37




no formal transaction other than downloading and             parsing of chemical names, followed by step-wise appli-
agreeing to license terms. One anecdote concerned a          cation of nomenclature rules. It is able to offer fast and
company that wished to donate money to an Open               precise conversions for the majority of names using
Source project but could not find a mechanism to do so.      IUPAC organic nomenclature, and is available as a web
  Industry participants also pointed out that there is a     service, Java library and standalone application for maxi-
considerable amount of contribution-in-kind from             mum interoperability.
industry, both from enhancements to software and also
the development of completely new software and toolk-        Chemical database software
its. Companies are now finding it easier to create           Registration, indexing and searching of chemical struc-
mechanisms for releasing Open Source software without        tures in relational databases is one of the core areas of
violating confidentiality or incurring liability. A phrase   cheminformatics. A number of structure registration
from the meeting summed it up: “The ice is beginning         systems have been published in the last five years,
to melt”, signifying that we can expect a rapid increase     exploiting the fact that Open Source cheminformatics
in industry’s interest in Open Source.                       toolkits such as Open Babel and the CDK are available.
                                                             OrChem [48], for example, is an open source extension
Converting chemical names and images to structures           for the Oracle 11G database that adds registration and
The majority of chemical information is not stored in        indexing of chemical structures to support fast substruc-
machine-readable formats, but rather as chemical names       ture and similarity searching. The cheminformatics
or depictions. The OSRA and OPSIN projects focus on          functionality is provided by the CDK. OrChem provides
extracting chemical information from these sources.          similarity searching with response times in the order of
Such software plays a particularly important role for        seconds for databases with millions of compounds,
data mining the chemical literature, including patents       depending on a given similarity cut-off. For substructure
and theses.                                                  searching, it can make use of multiple processor cores
   Optical Structure Recognition Application (OSRA)          on today’s powerful database servers to provide fast
[44] was started in early 2007 with the goal to create       response times in equally large data sets.
the first free and open source tool for extraction and         Besides the traditional and proven relational database
conversion of molecular images into SMILES and SD            approach with added chemical features (’cartridges’),
files. From the very beginning the underlying philosophy     there is growing interest in tools and approaches based
was to integrate existing open source libraries and to       on the web philosophy and practice. Several groups
avoid “reinventing the wheel” wherever possible. OSRA        [49,50] are experimenting with the Resource Description
relies on a variety of open source components: Open          Framework (RDF) language on the assumption that gen-
Babel for chemical format conversion and molecular           eric high-performance solutions will appear. RDF allows
property calculations, GraphicsMagick for image manip-       everything to be described by URIs (data, molecules,
ulation, Potrace for vectorisation, GOCR and OCRAD           dictionaries, relations). The Chempound system [31], as
for optical character recognition. The growing impor-        deployed in Quixote and elsewhere, is an RDF-based
tance of image recognition technology can be seen in         approach to chemical structures and compounds and
the fact that only a few years ago there was only one        their properties. For small to medium-sized collections
widely available software package for chemical structure     (such as an individual’s calculations or literature retrie-
recognition - CLiDE (commercially developed at Key-          val), there are many RDF tools (e.g. SIMILE, Apache
module, Ltd), but today there are as many as seven           Jena) which can operate in machine memory and pro-
available programs.                                          vide the flexibility that RDF offers. For larger systems, it
   OPSIN (Open Parser for Systematic IUPAC Nomen-            is unclear whether complete RDF solutions (e.g. Vir-
clature) [45] focuses instead on interpreting chemical       tuoso) will be satisfactory or whether a hybrid system
names. The chemical name is the oldest form of com-          based on name-value pairs (e.g. CouchDB, MongoDB)
munication used to describe chemicals, predating even        will be sufficient.
the knowledge of the atomic structure of compounds.
Chemical names are abundant in the scientific literature     Collaboration and interoperability
and encode valuable structural information. Through          One of the successes of the Blue Obelisk has been to
successive books of recommendations [46,47], IUPAC           bring developers together from different Open Source
has tried to codify and to an extent standardise naming      chemistry projects so that they look for opportunities to
practices. OPSIN aims to make this abundance of che-         collaborate rather than compete, and to leverage work
mical names machine readable by translating them to          done by other projects to avoid duplication of effort. As
SMILES, CML or InChI. The program is based around            an example of this, when in March 2008 the Jmol devel-
the use of a regular grammar to guide tokenisation and       opment team were looking to add support for energy



J. Cheminf. 2011, 3, 37.
O’Boyle et al. Journal of Cheminformatics 2011, 3:37                                                                     Page 9 of 15
http://www.jcheminf.com/content/3/1/37




minimisation, rather than implement a forcefield from               the danger of vendor lock-in (where users are con-
scratch they ported the UFF forcefield [51] implementa-             strained to using a particular software, a situation which
tion from Open Babel to Jmol. This code enables Jmol                puts them at a disadvantage). This applies as much to
to support 2D to 3D conversion of structures (through               Open Source software as to proprietary software. Cinf-
energy minimisation). In a similar manner, efficient Jmol           ony is a project (first release in May 2008) whose goal is
code for atom-atom rebonding has been ported to the                 to tackle this problem in the area of cheminformatics
CDK. Figure 4 shows the collaborative nature of soft-               toolkits [56]. It is a Python library that enables Open
ware developed in the Blue Obelisk, as one project                  Babel, the CDK, and RDKit (and shortly, Indigo and
builds on functionality provided by another project.                OPSIN) to be used using the same API; this makes it
  Another collaborative initiative between Blue Obelisk             easy, for example, to read a molecule using Open Babel,
projects was the establishment in May 2008 of the Che-              calculate descriptors using the CDK and create a depic-
miSQL project. This brought together the developers of              tion using RDKit.
several open source chemistry database cartridges                     Another way through which interoperability of Blue
(PgChem [52], Mychem [53], OrChem [48] and more                     Obelisk projects has been promoted and developed is
recently Bingo [54]) with a view to making their data-              through integration into workflow software such as
base APIs more similar and collaborating on benchmark               Taverna [57] and KNIME [58] (both open source). Such
datasets for assessing performance. For two of these                software makes it easy to automate recurring tasks, and
projects, PgChem and Mychem, which are both based                   to combine analyses or data from a variety of different
on Open Babel, there is the additional possibility of               software and web services. A combination of the Chem-
working together on a shared codebase.                              istry Development Kit and Taverna, for instance, was
  In the area of cheminformatics toolkits, two of the               reported in 2010 [59]. In the case of KNIME, it comes
existing toolkits Open Babel and RDKit are planning to              with built-in basic collection of CDK-based and Open
work together on a common underlying framework                      Babel-based nodes, while other nodes for the RDKit and
called MolCore [55]. This project is still in the planning          Indigo are available from KNIME’s “Community
stage, but if it is a success it will mean that the the two         Updates” site.
libraries will be interoperable (while retaining their
existing focus) but also that the cost of maintaining the           Open Standards
code will be shared among more developers, freeing                  Chemical Markup Language, CML
time for the development of new features.                           Chemical Markup Language (CML) is discussed in sev-
  One of the goals of the Blue Obelisk is to promote                eral articles in this issue, and a brief summary here re-
interoperability in chemical informatics. When barriers             iterates that it is designed primarily to create a validata-
exist to moving chemical data between different soft-               ble semantic representation for chemical objects. The
ware, the community becomes fragmented and there is                 five main areas (molecules, reactions, computational




 Figure 4 Dependency diagram of some Blue Obelisk projects. Each block represents a project. Square blocks show Open Data, ovals are
 Open Source, and diamonds are Open Standards.




J. Cheminf. 2011, 3, 37.
O’Boyle et al. Journal of Cheminformatics 2011, 3:37                                                     Page 10 of 15
http://www.jcheminf.com/content/3/1/37




chemistry, spectra and solid-state (see above)) have now    calls to libraries written in languages such as C, C++
all been extensively deployed and tested. CML can           and Fortran, and compiled into native, machine specific,
therefore be used as a reference for input and output       code. JNI-InChI provides a thin C wrapper, with corre-
for Blue Obelisk software and a means of representing       sponding Java code, around the IUPAC InChI library,
data in Blue Obelisk resources.                             exposing the InChI library’s functionality to the JVM.
   CML, being an XML application, can inter-operate         To overcome the need to have the correct InChI library
with other markup languages and in particular XHTML,        pre-installed on a system, JNI-InChI comes with a vari-
SVG, MathML, docx and more specialised applications         ety of precompiled native binaries and automatically
such as UnitsML and GML (geosciences). We believe           extracts and deploys the correct one for the detected
that it would be possible using these languages to          operating system and architecture. The JNI-InChI
encode large parts of, say, first year chemistry text       library comes with native binaries supporting a range of
books in XML. Similarly, it is possible to create com-      operating systems and architectures; the current version
pound documents with word processing or spreadsheet         has binaries for 32- and 64-bit Windows, Linux and
software that have inter-operating text, graphics and       Solaris, 64-bit FreeBSD and 64-bit Intel-based Mac OS
chemistry (as in Chem4Word). Being a markup lan-            X - a number of which are not supported by the original
guage, CML is designed for re-purposing, including sty-     IUPAC distribution of InChI. The JNI-InChI project has
ling, and therefore a mixture of these languages can be     matured to support the full range of functionality of the
used for chemical catalogues, general publications, log-    InChI C library: structure-to-InChI, InChI-to-structure,
books and many other types of document in the scienti-      AuxInfo-to-structure, InChIKey generation, and InChI
fic process.                                                and InChIKey validation. JNI-InChI provides the InChI
   CML describes much of its semantics through conven-      functionality for a number of Open Source projects,
tions and dictionaries, and the emerging ecosystem          including the Chemistry Development Kit, Bioclipse and
(especially in computational chemistry) is available as a   CMLXOM/JUMBO, and is also used by commercial
semantic resource for many of the applications and spe-     applications and internally in a number of companies.
cifications in this article.                                Through its widespread use and Open Source develop-
                                                            ment model, a number of issues in earlier versions of
InChI                                                       the software have been identified and resolved, and JNI-
The IUPAC InChI identifier is a non-proprietary and         InChI now offers a robust tool for working with InChIs
unique identifier for chemical substances designed to       in the JVM.
enable linking of diverse data compilations. Prior to the
development of the InChI identifier chemical informa-       OpenSMILES
tion systems and databases used a wide variety of (gen-     One of the most widely used ways to store chemical
erally proprietary) identifiers, greatly limiting their     structures is the SMILES format (or SMILES string).
interoperability. Although its development predates the     This is a linear notation developed by Daylight Informa-
Blue Obelisk, software such as Open Babel has included      tion Systems that describes the connection table of a
InChI support since 2005, and support for InChI in          molecule and may optionally encode chirality. Its popu-
Indigo is due in 2011.                                      larity stems from the fact that it is a compact represen-
  Since the official InChI implementation is in C, it is    tation of the chemical structure that is human readable
difficult to access from the other widely used language     and writable, and is convenient to manipulate (e.g. to
for cheminformatics toolkits, Java. Early attempts to       include in spreadsheets, or copy from a web page).
generate InChI identifiers from within Java involved          Despite its widespread use, a formal definition of the
programatically launching the InChI executable and cap-     language did not exist beyond Daylight’s SMILES The-
turing the output, an approach that was found to be         ory Manual and tutorials. This caused some confusion
fairly unreliable and broke the ‘write once, run any-       in the implementation and interpretation of corner
where’ philosophy of Java. The Blue Obelisk project JNI-    cases, for example the handling of cis/trans bond sym-
InChI [60] was established in 2006 to solve this problem    bols at ring closures. In 2007, Craig James (eMolecules)
by using the Java Native Interface framework to provide     initiated work on the OpenSMILES specification [61], a
transparent access to the InChI library from within Java    complete specification of the SMILES language as an
and other Java Virtual Machine (JVM) based languages,       Open Standard developed through a community pro-
supporting the wider adoption of this standard identifier   cess. The specification is largely complete and contains
by the chemistry community.                                 guidelines on reading SMILES, a formal grammar,
  The Java Native Interface framework provides a            recommendations on standard forms when writing
mechanism for code running inside the JVM, to place         SMILES, as well as proposed extensions.




J. Cheminf. 2011, 3, 37.
O’Boyle et al. Journal of Cheminformatics 2011, 3:37                                                                                    Page 11 of 15
http://www.jcheminf.com/content/3/1/37




QSAR-ML                                                                          has their own set of rules. A common reference specifi-
The field of QSAR has long been hampered by the lack                             cation for standardisation would be of immense value in
of open standards, which makes it difficult to share and                         interoperability between structure repositories as well as
reproduce descriptor calculations and analyses. QSAR-                            between toolkits (though the latter is still confounded
ML was recently proposed as an open standard for                                 by differences in lower level cheminformatic features
exchanging QSAR datasets [62]. A dataset in QSAR-ML                              such as aromaticity models).
includes the chemical structures (preferably described in                          We have already discussed the development of an
CML) with InChI to protect integrity, chemical descrip-                          Open SMILES standard. While much progress has been
tors linked to the Blue Obelisk Descriptor Ontology                              made towards a complete specification, more remains to
[63], response values, units, and versioned descriptor                           be done before this can be considered finished. After
implementations to allow descriptors from different                              that point, the next logical step would be to start work
software to be integrated into the same calculation.                             on a standard for the SMARTS language, the extension
Hence, a dataset described in QSAR-ML is completely                              to SMILES that specifies patterns that match chemical
reproducible. To allow for easy setup of QSAR-ML                                 substructures.
compliant datasets, a plugin for Bioclipse was created
with a graphical interface for setting up QSAR datasets                          Open Data
and performing calculations. Descriptor implementa-                              A considerable stumbling block in advocating the
tions are available from the CDK and JOELib, as well as                          release of scientific data as Open Data has been how
via remote web services such as XMPP [64].                                       exactly to define “Open.” A major step forward was the
                                                                                 launch in 2010 of the Panton Principles for Open Data
Remaining challenges                                                             in Science [66]. This formalises the idea that Open Data
A core requirement for chemical structure databases                              maximises the possibility of reuse and repurposing, the
and chemical registration systems in general is the                              fundamental basis of how science works. These princi-
notion of structure standardisation. That is, for a given                        ples recommend that published data be licensed expli-
input structure, multiple representations should be con-                         citly, and preferably under CC0 (Creative Commons ‘No
verted to one canonical form. Structure canonicalisation                         Rights Reserved’, also known as CCZero) [67]. This
routines partially address this aspect, converting multi-                        license allows others to use the data for any purpose
ple alternative topologies to a single canonical form.                           whatsoever without any barriers. Other licenses compa-
However, the problem of standardisation is broader                               tible with the Panton Principles include the Open Data
than just topological canonicalisation. Features that                            Commons Public Domain Dedication and Licence
must be considered include                                                       (PDDL), the Open Data Commons Attribution License,
                                                                                 and the Open Data Commons Open Database License
    •   topological canonicalisation                                             (ODbL) [68].
    •   handling of charges                                                        Despite this positive news, little chemical data compa-
    •   tautomer enumeration and canonicalisation                                tible with these principles has become available from
    •   normalisation of functional groups                                       the traditional chemistry fields of organic, inorganic, and
                                                                                 solid state chemistry. Table 2 lists a few notable excep-
  Currently, most of the individual components of a                              tions, some of which are discussed further below. There
‘standardisation pipeline’ can be implemented using                              is also data available using licenses not compatible with
Blue Obelisk tools. The larger problem is that there is                          the Panton Principles, but where the user is allowed to
no agreed upon list of steps for a standardisation pro-                          modify and redistribute the data. A new data set in this
cess. While some specifications have been published (e.                          category is the data from the ChEMBL database [69],
g., PubChem) and some standardisation services and                               which is available under the Creative Commons Share-
tools are available (for example, PubChem provides an                            Alike Attribution license. The RSC ChemSpider data-
online service to standardise molecules [65]) each group                         base [41], although not fully Open, also hosts Open

Table 2 Open Data in chemistry.
Name                                    License/Waiver          Description
Chempedia [98]                                CC0               Crowd-sourced chemical names (project discontinued but data still available)
CrystalEye                                   PPDL               Crystal structures from primary literature
ONS Solubility                                CC0               Solubility data for various solvents
Reaction Attempts                             CC0               Data on successful and unsuccessful reactions
Overview of major open chemical data available under a license or waiver compatible with the Panton Principles.




J. Cheminf. 2011, 3, 37.
O’Boyle et al. Journal of Cheminformatics 2011, 3:37                                                      Page 12 of 15
http://www.jcheminf.com/content/3/1/37




Data; for example, spectral data when deposited can be       are available under a CC0 license. Several web services
marked as Open.                                              and feeds are available to filter and re-use the dataset
  Importantly, publishing data as CC0 is becoming            [80]. In particular, models have been developed for the
easier now that websites are becoming available to sim-      prediction of non-aqueous solubility in 72 different sol-
plify publishing data. Two projects that can be men-         vents [81] using the method of Abraham et al [82] with
tioned in this context are FigShare [70], where the data     descriptors calculated by the Chemistry Development
behind unpublished figures can be hosted, and Dryad          Kit. These models are available online and will be
[71] where data behind publications can be hosted.           refined as more solubility data is collected.
Initiatives like this make it possible to host small
amounts of data, and those combined are expected to          The Blue Obelisk Data Repository (BODR)
become soon a substantial knowledge base.                    The Blue Obelisk has created a repository of key chemi-
                                                             cal data in a machine-readable format [83]. The BODR
Reaction Attempts                                            focuses on data that is commonly required for chemistry
Although there are existing databases that allow for         software, and where there is a need to ensure that
searching reactions, those using Open Data are harder        values are standard between codes. Examples are atomic
to find. The Reaction Attempts database [72], to which       masses and conversions between physical constants.
anyone can submit reaction attempts data, consists           These data can be used by others for any purpose (for
mainly of reaction information abstracted from Open          example, for entry into Wikipedia or use in in-house
Notebooks in organic chemistry, such as the Useful-          software), and should lead to an enhancement in the
Chem project from the Bradley group [73] and the note-       quality of community reference data. The Blue Obelisk
books from the Todd group [74]. Key information from         provides also a complementary project, the Chemical
each experiment is abstracted manually, with the only        Structure Repository [83]. It aims to provide 3D coordi-
required information consisting of the ChemSpider IDs        nates, InChIs and several physico-chemical descriptors
of the reactants and the product targeted in the experi-     for a set of 570 organic compounds.
ment; and a link to the laboratory notebook page. Infor-
mation in the database can be searched and accessed          NMRShiftDB
using the web-based Reaction Attempts Explorer [75].         NMRShiftDB [84,85] represents one of the earliest
  Since the database reflects all data from the note-        resources for Open community-contributed data (first
books, it includes experiments in progress, ambiguous        released in 2003). Research groups that measure NMR
results and failed runs. Unlike most reaction databases      spectra or extract it from the literature can contribute
that only identify experiments successfully reported in      that information to NMRShiftDB which provides an
the literature, the Reaction Attempts Explorer allows        Open resource where entries can be searched by chemi-
researchers to easily find patterns in reactions that have   cal structure or properties (especially peaks). Although
already been performed, and since the data are open          it is difficult to encourage large amounts of altruistic
and results are reported across all research groups,         contribution (as happens with Wikipedia), an alternative
intersections are easily discovered and possible Open        possible source of data could come from linking data
Collaboration opportunities are easily found [76,77].        capture with data publication. For example, the Blue
                                                             Obelisk has enough software that it is possible to create
Non-Aqueous Solubility                                       a seamless chain for converting NMR structures in-
Although the aqueous solubility of many common               house into NMRShiftDB entries. If and when the chem-
organic compounds is generally available, quantitative       istry community encourages or requires semantic publi-
reports of non-aqueous solubility are more difficult to      cation of spectra rather than PDFs, it would be possible
find. Such information can be valuable for selecting sol-    to populate NMRShiftDB rapidly along the the lines of
vents for reactions, re-crystallization and related pro-     CrystalEye (see below). A similar approach has been
cesses. In 2008, the Open Notebook Science Solubility        demonstrated earlier using the Blue Obelisk components
Challenge was launched for the purpose of measuring          Oscar and Bioclipse using text mining approaches [86].
non-aqueous solubility of organic compounds, reporting
all the details of the experiments in an Open Notebook       CrystalEye
and recording the results as Open Data in a centralized      CrystalEye [87] is an example of cost-effiective extrac-
database [78,79]. This crowdsourcing project was also        tion of data from the literature where this is published
supported by Submeta, Sigma-Aldrich, Nature Publish-         both Openly and semantically. Software extracts
ing Group and the Royal Society of Chemistry. The            Openly-published crystal structures from a variety of
database currently holds 1932 total measurements and         scholarly journals, processes them and then makes them
1428 averaged solute/solvent measurements all of which       available through a web interface. It currently contains



J. Cheminf. 2011, 3, 37.
O’Boyle et al. Journal of Cheminformatics 2011, 3:37                                                                         Page 13 of 15
http://www.jcheminf.com/content/3/1/37




about 250,000 structures. CrystalEye serves as a model        with the wider chemistry community outside of the Blue
for a high-value, high-quality Open data resource,            Obelisk remains an open question. If the Blue Obelisk is
including the licensing of each component as Panton-          truly to make an impact, then an attempt must be made
compatible Open data.                                         to reach beyond the subscribers to the Blue Obelisk
                                                              mailing list and blogs of members.
Other areas of activity                                         We hope to see this involvement between the Blue
While each Blue Obelisk project has its own website and       Obelisk and the wider community grow in the future.
point of contact (typically a mailing list), because of the   To this end, we encourage the reader to visit the Blue
breadth of Blue Obelisk projects it can be difficult for a    Obelisk website [94], send a message to our mailing list,
newcomer to understand which of them, if any, can best        investigate related projects or read our blogs.
address a particular problem. To address this issue,
members of the Blue Obelisk established a Question 
                                                              Acknowledgements
Answer website [88] (see Figure 5). This is a website in      NMOB is supported by a Health Research Board Career Development
the style of Stack Overflow [89] that encourages high         Fellowship (PD/2009/13). The OSRA project has been funded in whole or in
quality answers (and questions) through the use of a          part with federal funds from the National Cancer Institute, National Institutes
                                                              of Health, under contract HHSN261200800001E. The content of this
voting system. In the year since it was established, over     publication does not necessarily reflect the views of the policies of the
200 users have registered, many of whom had no pre-           Department of Health and Human Services, nor does mention of trade
vious involvement with the Blue Obelisk, showing that         names, commercial products, or organisations imply endorsement by the US
                                                              Government.
the QA website complements earlier existing channels
of communication.                                             Author details
                                                              1
  The rise of self-publishing and print-on-demand ser-         Analytical and Biological Chemistry Research Facility, Cavanagh Pharmacy
                                                              Building, University College Cork, College Road, Cork, Co. Cork, Ireland. 2NIH
vices has meant that publishing a book is now as              Center for Translational Therapeutics, 9800 Medical Center Drive, Rockville,
straightforward as uploading to an appropriate website.       MD 20878, USA. 3Division of Molecular Toxicology, Institute of Environmental
Unlike the traditional publishing route where books           Medicine, Nobels väg 13, Karolinska Institutet, 171 77 Stockholm, Sweden.
                                                              4
                                                               Unilever Centre for Molecular Sciences Informatics, Department of
with projected low sales volume would be expensive,           Chemistry, University of Cambridge, Lensfield Road, CB2 1EW, UK.
websites such as Lulu [90] allow the sale of low-priced       5
                                                               Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 751
books on chemistry software, and books are now avail-         24 Uppsala, Sweden. 6Department of Chemistry, Drexel University, 32nd and
                                                              Chestnut streets, Philadelphia, PA 19104, USA. 7Chemical Biology Laboratory,
able for purchase on Jmol [91], the Chemistry Develop-        Basic Research Program, SAIC-Frederick, Inc., NCI-Frederick, Frederick, MD
ment Kit [92] and Open Babel [93].                            21702, USA. 8St. Olaf College, 1520 St. Olaf Ave., Northfield, MN 55057, USA.
                                                              9
                                                               Kitware, Inc., 28 Corporate Drive, Clifton Park, NY 12065, USA. 10Department
                                                              of Chemistry, University of Pittsburgh, 219 Parkman Avenue, Pittsburgh, PA
Conclusions                                                   15260, USA. 11eMolecules Inc., 380 Stevens Ave., Solana Beach, California
We have shown that the Blue Obelisk has been very             92075, USA. 12Ideaconsult Ltd., 4.A.Kanchev str., Sofia 1000, Bulgaria.
                                                              13
successful in bringing together researchers and develo-         Department of Engineering, Computer Science, Physics, and Mathematics,
                                                              Oral Roberts University, 7777 S. Lewis Ave. Tulsa, OK 74171, USA. 14Leiden
pers with common interests in ODOSOS, leading to              Institute of Chemistry, Leiden University, Einsteinweg 55, 2333 CC Leiden,
development of many useful resources freely available to      The Netherlands. 15Department of Chemistry, State University of New York at
the chemistry community. However, how best to engage          Buffalo, Buffalo, NY 14260-3000, USA. 16Université de Strasbourg, IPHC, CNRS,
                                                              UMR7178, 23 rue du Loess 67037, Strasbourg, France. 17GGA Software
                                                              Services LLC, 41 Nab. Chernoi rechki 194342, Saint Petersburg, Russia.
                                                              18
                                                                Cheminformatics and Metabolism Team, European Bioinformatics Institute
                                                              (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.
                                                              19
                                                                Department of Chemistry, University of Washington, Seattle, WA 98195,
                                                              USA. 20iChemLabs, 200 Centennial Ave., Suite 200, Piscataway, NJ 08854,
                                                              USA.

                                                              Authors’ contributions
                                                              The overall layout of the manuscript grew from discussions between NMOB,
                                                              RG and ELW. The authorship of the paper is drawn from those people
                                                              connected with fully Open Data/Standards/Source (OSI-compliant or OKF-
                                                              compliant) projects associated with the Blue Obelisk. There are a large
                                                              number of people contributing to these projects and because those projects
                                                              are published in their own right it is not appropriate to include all their
                                                              developers by default. We invited a number of ‘project gurus’ who have
                                                              been active in promoting the Blue Obelisk, to be authors on this paper and
                                                              most have accepted and contributed.

                                                              Competing interests
                                                              The authors declare that they have no competing interests.
 Figure 5 Screenshot of the Blue Obelisk eXchange Question    Received: 1 July 2011 Accepted: 14 October 2011
 and Answer website.                                          Published: 14 October 2011




J. Cheminf. 2011, 3, 37.
O’Boyle et al. Journal of Cheminformatics 2011, 3:37                                                                                               Page 14 of 15
http://www.jcheminf.com/content/3/1/37




References                                                                          34. ChemDoodle Web Components: HTML5 Chemistry. [http://web.
1. Matos PD, Alcantara R, Dekker A, Ennis M, Hastings J, Haug K, Spiteri I,             chemdoodle.com].
    Turner S, Steinbeck C: Chemical Entities of Biological Interest: an update.     35. Stahl MT: Open-source software: not quite endsville. Drug Discov Today
    Nucleic Acids Res 2009, 38:D249-D254.                                               2005, 10:219-22.
2. Murray-Rust P: The Blue Obelisk. CDK News 2005, 2:43-46.                         36. DeLano WL: The case for open-source software in drug discovery. Drug
3. Guha R, Howard MT, Hutchison GR, Murray-Rust P, Rzepa H, Steinbeck C,                Discov Today 2005, 10:213-7.
    Wegner J, Willighagen EL: The Blue Obelisk - Interoperability in Chemical       37. Munos B: Can open-source RD reinvigorate drug research? Nat Rev Drug
    Informatics. J Chem Inf Model 2006, 46:991-998.                                     Discov 2006, 5:723-9.
4. Murray-Rust P, Rzepa HS: Chemical Markup, XML, and the Worldwide                 38. Geldenhuys WJ, Gaasch KE, Watson M, Allen DD, Van der Schyf CJ:
    Web. 1. Basic Principles. J Chem Inf Comput Sci 1999, 39:928-942.                   Optimizing the use of open-source software applications in drug
5. Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E: The              discovery. Drug Discov Today 2006, 11:127-32.
    Chemistry Development Kit (CDK): An Open-Source Java Library for                39. iBabel. [http://homepage.mac.com/swain/Sites/Macinchem/page65/ibabel3.
    Chemo- and Bioinformatics. J Chem Inf Comput Sci 2003, 43:493-500.                  html].
6. Steinbeck C, Hoppe C, Kuhn S, Floris M, Guha R, Willighagen EL: Recent           40. ChemSpotlight. [http://chemspotlight.openmolecules.net].
    developments of the chemistry development kit (CDK) - an open-source            41. ChemSpider - the free chemical database. [http://www.chemspider.com].
    java library for chemo- and bioinformatics. Curr Pharm Design 2006,             42. iChemLabs and RSC ChemSpider announce partnership. [http://www.
    12:2111-2120.                                                                       chemspider.com/blog/ichemlabs-and-rsc-chemspider-announce-partnership.
7. Open Babel. [http://openbabel.org].                                                  html].
8. JOELib. [http://sf.net/projects/joelib].                                         43. Silicos Open Source Software. [http://silicos.silicos-it.com/download.html].
9. Rahman SA, Bashton M, Holliday GL, Schrader R, Thornton JM: Small                44. OSRA. [http://cactus.nci.nih.gov/osra/].
    Molecule Subgraph Detector (SMSD) Toolkit. J Cheminf 2009, 1:12.                45. Lowe DM, Corbett PT, Murray-Rust P, Glen RC: Chemical Name to
10. Dong X, Gilbert K, Guha R, Heiland R, Kim J, Pierce M, Fox G, Wild D: A             Structure: OPSIN, an Open Source Solution. J Chem Inf Model 2011,
    Web Service Infrastructure for Chemoinformatics. J Chem Inf Model 2007,             51:739-753.
    47:1303-1307.                                                                   46. IUPAC: Nomenclature of Organic Chemistry Pergamon Press, Oxford; 1979.
11. RDKit. [http://rdkit.org].                                                      47. IUPAC: A Guide to IUPAC Nomenclature of Organic Compounds
12. Indigo. [http://ggasoftware.com/opensource/indigo].                                 (Recommendations 1993) Blackwell Scientific publications, Oxford; 1993.
13. Open Babel - Related Software. [http://openbabel.org/wiki/                      48. Rijnbeek M, Steinbeck C: OrChem - An open source chemistry search
    Related_Projects].                                                                  engine for Oracle(R). J Cheminf 2009, 1:17.
14. Spjuth O, Helmus T, Willighagen EL, Kuhn S, Eklund M, Wagener J, Murray-        49. Willighagen EL, Brändle MP: Resource description framework technologies
    Rust P, Steinbeck C, Wikberg JES: Bioclipse: an open source workbench               in chemistry. J Cheminf 2011, 3:15.
    for chemo- and bioinformatics. BMC Bioinformatics 2007, 8:59.                   50. Chen B, Dong X, Jiao D, Wang H, Zhu Q, Ding Y, Wild DJ: Chem2Bio2RDF:
15. Avogadro: an open-source molecular builder and visualization tool.                  a semantic framework for linking and data mining chemogenomic and
    [http://avogadro.openmolecules.net].                                                systems chemical biology data. BMC Bioinformatics 2010, 11:255.
16. Spjuth O, Alvarsson J, Berg A, Eklund M, Kuhn S, Masak C, Torrance G,           51. Rappé AK, Casewit CJ, Colwell KS, Goddard WA III, Skiff WM: UFF, a full
    Wagener J, Willighagen E, Steinbeck C, Wikberg J: Bioclipse 2: A                    periodic table force field for molecular mechanics and molecular
    scriptable integration platform for the life sciences. BMC Bioinformatics           dynamics simulations. J Am Chem Soc 1992, 114:10024-10035.
    2009, 10:397.                                                                   52. PgChem. [http://pgfoundry.org/projects/pgchem/].
17. Alvarsson J, Andersson C, Spjuth O, Larsson R, Wikberg J: Brunn: An open        53. Mychem. [http://mychem.sf.net].
    source laboratory information system for microplates with a graphical           54. Bingo. [http://ggasoftware.com/opensource/bingo].
    plate layout design process. BMC Bioinformatics 2011, 12:179.                   55. MolCore. [http://molcore.sf.net].
18. Kalzium - Periodic Table and Chemistry in KDE. [http://edu.kde.org/             56. O’Boyle NM, Hutchison GR: Cinfony-combining Open Source
    applications/science/kalzium/].                                                     cheminformatics toolkits behind a common interface. Chem Cent J 2008,
19. XtalOpt - Evolutionary Crystal Structure Prediction. [http://xtalopt.               2:24.
    openmolecules.net].                                                             57. Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T:
20. Lonie DC, Zurek E: XtalOpt: An open-source evolutionary algorithm for               Taverna: a tool for building and running workflows of services. Nucleic
    crystal structure prediction. Comput Phys Commun 2011, 182:372-387.                 Acids Res 2006, 34:W729-W732.
21. Jeliazkova N, Jeliazkov V: AMBIT RESTful web services: an implementation        58. KNIME. [http://www.knime.org].
    of the OpenTox application programming interface. J Cheminf 2011, 3:18.         59. Kuhn T, Willighagen E, Zielesny A, Steinbeck C: CDK-Taverna: an open
22. Jeliazkova N, Jaworska J, Worth A: Open Source Tools for Read-Across and            workflow environment for cheminformatics. BMC Bioinformatics 2010,
    Category Formation. In In Silico Toxicology : Principles and Applications.          11:159.
    Edited by: Cronin M, Madden J. Cambridge UK: RSC Publishing;                    60. JNI-InChI. [http://jni-inchi.sf.net/index.html].
    2010:408-445.                                                                   61. The OpenSMILES specification. [http://opensmiles.org].
23. ToxTree. [http://toxtree.sf.net].                                               62. Spjuth O, Willighagen EL, Guha R, Eklund M, Wikberg JE: Towards
24. Patlewicz G, Jeliazkova N, Safford RJ, Worth AP, Aleksiev B: An evaluation of       interoperable and reproducible QSAR analyses: Exchange of datasets. J
    the implementation of the Cramer classification scheme in the Toxtree               Cheminf 2010, 2:5.
    software. SAR QSAR Environ Res 2008, 19:495-524.                                63. The Blue Obelisk Descriptor Ontology. [http://qsar.sourceforge.net/dicts/
25. O’Boyle NM, Tenderholt AL, Langner KM: cclib: A library for package-                qsar-descriptors/index.xhtml].
    independent computational chemistry algorithms. J Comp Chem 2008,               64. Wagener J, Spjuth O, Willighagen EL, Wikberg JES: XMPP for cloud
    29:839-845.                                                                         computing in bioinformatics supporting discovery and invocation of
26. GaussSum. [http://gausssum.sf.net].                                                 asynchronous web services. BMC Bioinformatics 2009, 10:279.
27. QMForge. [http://qmforge.sf.net].                                               65. PubChem Standardization Service. [http://pubchem.ncbi.nlm.nih.gov//
28. Dapprich S, Frenking G: Investigation of Donor-Acceptor Interactions: A             standardize/standardize.cgi].
    Charge Decomposition Analysis Using Fragment Molecular Orbitals. J              66. Panton Principles - Principles for Open Data in Science. [http://
    Phys Chem 1995, 99:9352-9362.                                                       pantonprinciples.org].
29. JUMBO-Converters. [https://bitbucket.org/wwmm/jumbo-converters].                67. About CC0 - “No Rights Reserved”. [http://creativecommons.org/about/
30. Lensfield 2. [https://bitbucket.org/sea36/lensfield2].                              cc0].
31. Chempound. [https://bitbucket.org/chempound/chempound].                         68. Open Licenses - Data. [http://www.opendefinition.org/licenses/#Data].
32. Stein W, et al: Sage Mathematics Software The Sage Development Team;            69. Overington J: ChEMBL. An interview with John Overington, team leader,
    2011 [http://www.sagemath.org].                                                     chemogenomics at the European Bioinformatics Institute Outstation of
33. Hanson RM: Jmol - a paradigm shift in crystallographic visualization. J             the European Molecular Biology Laboratory (EMBL-EBI). Interview by
    Appl Cryst 2010, 43:1250-1260.                                                      Wendy A. Warr. J Comp Aided Mol Des 2009, 23:195-198.




J. Cheminf. 2011, 3, 37.
O’Boyle et al. Journal of Cheminformatics 2011, 3:37                                                                                          Page 15 of 15
http://www.jcheminf.com/content/3/1/37




70. FigShare. [http://figshare.com].
71. Dryad. [http://datadryad.org].
72. Reaction Attempts Database. [http://onswebservices.wikispaces.com/
    reactions].
73. Bradley JC: Useful Chemistry: Reaction Attempts Book Edition 1 and
    UsefulChem Archive.[http://usefulchem.blogspot.com/2010/04/reaction-
    attempts-book-edition-1-and.html].
74. Bradley JC: Useful Chemistry: The Synaptic Leap Experiments on
    Reaction Attempts.[http://usefulchem.blogspot.com/2010/05/synaptic-leap-
    experiments-on-reaction.html].
75. Bradley JC: Useful Chemistry: Reaction Attempts Explorer.[http://
    usefulchem.blogspot.com/2010/06/reaction-attempts-explorer.html].
76. Bradley JC: Useful Chemistry: Visualizing Social Networks in Open
    Notebooks.[http://usefulchem.blogspot.com/2010/12/visualizing-social-
    networks-in-open.html].
77. Bradley JC, Lang AS, Koch S, Neylon C: Collaboration using Open
    Notebook Science in Academia. In Collaborative computational
    technologies for biomedical research. Edited by: Ekins S, Hupcey MA, Williams
    AJ. Hoboken N.J.: John Wiley 2011:425-452.
78. Bradley JC, Neylon C, Guha R, Williams AJ, Hooker B, Lang ASID, Friesen B,
    Bohinski T, Bulger D, Federici M, Hale J, Mancinelli J, Mirza KB, Moritz MJ,
    Rein D, Tchakounte C, Truong HT: Open Notebook Science Challenge:
    Solubilities of Organic Compounds in Organic Solvents. Nature Precedings
    2010 [http://dx.doi.org/10.1038/npre.2010.4243.3].
79. Bradley J, Guha R, Lang A, Lindenbaum P, Neylon C, Williams A,
    Willighagen E: Beautifying Data in the Real World. In Beautiful Data.. 1
    edition. Edited by: Segaran T, Hammerbacher J. Sebastopol CA: O’Reilly;
    2009:259-278.
80. Open Notebook Solubility Web Services. [http://onswebservices.
    wikispaces.com/solubility].
81. Bradley J: Useful Chemistry: General Transparent Solubility Prediction
    using Abraham Descriptors.[http://usefulchem.blogspot.com/2010/07/
    general-transparent-solubility.html].
82. Abraham MH, Smith RE, Luchtefeld R, Boorem AJ, Luo R, Acree WE Jr:
    Prediction of solubility of drugs and other compounds in organic
    solvents. J Pharm Sci 2010, 99:1500-1515.
83. Blue Obelisk Data Repository. [http://bodr.sf.net].
84. NMRShiftDB. [http://www.nmrshiftdb.org].
85. Steinbeck C, Kuhn S: NMRShiftDB - compound identification and
    structure elucidation support through a free community-build web
    database. Phytochemistry 2004, 65:2711-2717.
86. Willighagen EL: Chemical Archeology: OSCAR3 to NMRShiftDB.org.[http://
    chem-bla-ics.blogspot.com/2006/09/chemical-archeology-oscar3-to.html].
87. CrystalEye. [http://wwmm.ch.cam.ac.uk/crystaleye/].
88. Blue Obelisk QA. [http://blueobelisk.shapado.com].
89. Stack Overflow. [http://stackoverflow.com].
90. Lulu. [http://lulu.com].
91. Herráez A: In How to use Jmol to study and present molecular structures.
    Volume 1. Lulu Enterprises, Morrisville, NC, US; 2007.
92. Willighagen E: Groovy Cheminformatics with the Chemistry Development Kit
    Lulu Enterprises, Morrisville, NC, US; 2011.
93. Hutchison GR, Morley C, O’Boyle NM, James C, Swain C, De Winter H,
    Vandermeersch T: Open Babel - Official User Guide Lulu Enterprises,
    Morrisville, NC, US; 2011.
94. Blue Obelisk web site. [http://blueobelisk.org].
95. Day N, Murray-Rust P, Tyrrell S: CIFXML: a schema and toolkit for
    managing CIFs in XML. J Appl Cryst 2011, 44:628-634.
96. Hawizy L, Jessop D, Adams N, Murray-Rust P: ChemicalTagger: A tool for
    Semantic Text-mining in Chemistry. J Cheminf 2011, 3:17.
                                                                                    Publish with ChemistryCentral and every
97. O’Boyle NM, Vandermeersch T, Flynn CJ, Maguire AR, Hutchison GR: Confab         scientist can read your work free of charge
    - Systematic generation of diverse low-energy conformers. J Cheminf
                                                                                              Open access provides opportunities to our
    2011, 3:8.
98. Chempedia. [http://chempedia.com].                                                    colleagues in other parts of the globe, by allowing
                                                                                              anyone to view the content free of charge.
 doi:10.1186/1758-2946-3-37
                                                                                                                 W. Jeffery Hurst, The Hershey Company.
 Cite this article as: O’Boyle et al.: Open Data, Open Source and Open
 Standards in chemistry: The Blue Obelisk five years on. Journal of                   available free of charge to the entire scientific community
 Cheminformatics 2011 3:37.
                                                                                      peer reviewed and published immediately upon acceptance
                                                                                      cited in PubMed and archived on PubMed Central
                                                                                      yours you keep the copyright
                                                                                    Submit your manuscript here:
                                                                                    http://www.chemistrycentral.com/manuscript/




J. Cheminf. 2011, 3, 37.

More Related Content

PDF
CSP as a Domain-Specific Language Embedded in Python and Jython
 
PPTX
Sour Pickles
PDF
Crifanlib python
PDF
Tools4BPEL4Chor
PDF
Avogadro: Open Source Libraries and Application for Computational Chemistry
PPT
The range isn’t all that's hot in the kitchen
PDF
Trabajando con responsabilidad social
PDF
Just One Account, How Lean Startup Principles Have Driven an Enterprise Marke...
CSP as a Domain-Specific Language Embedded in Python and Jython
 
Sour Pickles
Crifanlib python
Tools4BPEL4Chor
Avogadro: Open Source Libraries and Application for Computational Chemistry
The range isn’t all that's hot in the kitchen
Trabajando con responsabilidad social
Just One Account, How Lean Startup Principles Have Driven an Enterprise Marke...

Viewers also liked (15)

PDF
Resume
PPT
Going Green Without Greenbacks
PPTX
CV Hack
PDF
Company culture
PPTX
2014 IS 101 lec5
PDF
Recuperação paralela
PPTX
Social Media_Hammer
PPT
Going Green Without the Greenbacks, 2011
PPTX
Using Social Media to Amplify Academic Events
PPT
Eleanor olsen, interior designer
PDF
My graduation speech for post graduate
PDF
Investing in East Africa
PDF
Increasing Business Productivity in Connected Enterprises and an Always-On Di...
PPTX
Automotive SEO: How to Win the Race and Blow Past Your Competitors
PPT
Persamaan-lingkaran
Resume
Going Green Without Greenbacks
CV Hack
Company culture
2014 IS 101 lec5
Recuperação paralela
Social Media_Hammer
Going Green Without the Greenbacks, 2011
Using Social Media to Amplify Academic Events
Eleanor olsen, interior designer
My graduation speech for post graduate
Investing in East Africa
Increasing Business Productivity in Connected Enterprises and an Always-On Di...
Automotive SEO: How to Win the Race and Blow Past Your Competitors
Persamaan-lingkaran
Ad

Similar to My Open Access papers (20)

PPTX
Intro to Open Babel
PPTX
Open Babel project overview
PDF
Chemical Databases and Open Chemistry on the Desktop
PPTX
Cinfony - Combining disparate cheminformatics resources into a single toolkit
PDF
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
PPTX
Cinfony - Bring cheminformatics toolkits into tune
PDF
Open Source Visualization of Scientific Data
PDF
Open-source from/in the enterprise: the RDKit
PDF
The Open Chemistry Project
PDF
Avogadro, Open Chemistry and Semantics
PDF
Integrating R with the CDK: Enhanced Chemical Data Mining
PDF
EnCOrE: Chemistry, Education, Knowledge From the Real to the Virtual Needs, P...
PDF
Python for Chemistry
PDF
Python for Chemistry
PPTX
What's New and Cooking in Open Babel 2.3.2
PPTX
2015 bioinformatics bio_python
PDF
Cheminformatics toolkits: a personal perspective
PPTX
Improving the quality of chemical databases with community-developed tools (a...
PDF
Some "challenges" on the open-source/open-data front
Intro to Open Babel
Open Babel project overview
Chemical Databases and Open Chemistry on the Desktop
Cinfony - Combining disparate cheminformatics resources into a single toolkit
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Cinfony - Bring cheminformatics toolkits into tune
Open Source Visualization of Scientific Data
Open-source from/in the enterprise: the RDKit
The Open Chemistry Project
Avogadro, Open Chemistry and Semantics
Integrating R with the CDK: Enhanced Chemical Data Mining
EnCOrE: Chemistry, Education, Knowledge From the Real to the Virtual Needs, P...
Python for Chemistry
Python for Chemistry
What's New and Cooking in Open Babel 2.3.2
2015 bioinformatics bio_python
Cheminformatics toolkits: a personal perspective
Improving the quality of chemical databases with community-developed tools (a...
Some "challenges" on the open-source/open-data front
Ad

More from baoilleach (20)

PPTX
We need to talk about Kekulization, Aromaticity and SMILES
PPTX
So I have an SD File... What do I do next?
PPTX
Chemistrify the Web
PPTX
Universal Smiles: Finally a canonical SMILES string
PPT
Protein-ligand docking
PPTX
Cheminformatics
PPT
Making the most of a QM calculation
PDF
Data Analysis in QSAR
PPTX
Large-scale computational design and selection of polymers for solar cells
PPTX
De novo design of molecular wires with optimal properties for solar energy co...
PPT
Density functional theory calculations on Ruthenium polypyridyl complexes inc...
PDF
Application of Density Functional Theory to Scanning Tunneling Microscopy
PPT
Towards Practical Molecular Devices
PPT
Why multiple scoring functions can improve docking performance - Testing hypo...
PPT
Why multiple scoring functions can improve docking performance - Testing hypo...
PPT
Improving enrichment rates
PPT
The Blue Obelisk community
PPTX
Interoperability and the Blue Obelisk
PPT
Goslar2010 poster
PDF
Open Babel 2.3 Quick Reference
We need to talk about Kekulization, Aromaticity and SMILES
So I have an SD File... What do I do next?
Chemistrify the Web
Universal Smiles: Finally a canonical SMILES string
Protein-ligand docking
Cheminformatics
Making the most of a QM calculation
Data Analysis in QSAR
Large-scale computational design and selection of polymers for solar cells
De novo design of molecular wires with optimal properties for solar energy co...
Density functional theory calculations on Ruthenium polypyridyl complexes inc...
Application of Density Functional Theory to Scanning Tunneling Microscopy
Towards Practical Molecular Devices
Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...
Improving enrichment rates
The Blue Obelisk community
Interoperability and the Blue Obelisk
Goslar2010 poster
Open Babel 2.3 Quick Reference

Recently uploaded (20)

PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Pre independence Education in Inndia.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
Institutional Correction lecture only . . .
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Cell Types and Its function , kingdom of life
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
01-Introduction-to-Information-Management.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Pre independence Education in Inndia.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Anesthesia in Laparoscopic Surgery in India
human mycosis Human fungal infections are called human mycosis..pptx
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
RMMM.pdf make it easy to upload and study
O5-L3 Freight Transport Ops (International) V1.pdf
Supply Chain Operations Speaking Notes -ICLT Program
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Microbial disease of the cardiovascular and lymphatic systems
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Institutional Correction lecture only . . .
Abdominal Access Techniques with Prof. Dr. R K Mishra
TR - Agricultural Crops Production NC III.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Cell Types and Its function , kingdom of life
Microbial diseases, their pathogenesis and prophylaxis
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
01-Introduction-to-Information-Management.pdf

My Open Access papers

  • 1. Open Access Publications of Noel O’Boyle November 2, 2011
  • 3. Contents I Cheminformatics toolkits 5 1 Pybel: a Python wrapper for the OpenBabel cheminformatics toolkit 7 2 Cinfony - combining Open Source cheminformatics toolkits behind a common interface 15 3 Open Babel: An open chemical toolbox 25 II Enzyme reaction mechanisms 39 4 MACiE: a database of enzyme reaction mechanisms 41 5 MACiE (Mechanism, Annotation and Classification in Enzymes): novel tools for search- ing catalytic mechanisms 43 III QSAR 49 6 PYCHEM: a multivariate analysis package for python 51 7 Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction 53 IV The Rest 69 8 Userscripts for the life sciences 71 9 Confab - Systematic generation of diverse low-energy conformers 83 10 Review of “Data Analysis with Open Source Tools” 93 11 Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on 95 3
  • 7. Chemistry Central Journal Software Open Access Pybel: a Python wrapper for the OpenBabel cheminformatics toolkit Noel M O'Boyle*1,2, Chris Morley3 and Geoffrey R Hutchison4 Address: 1Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK, 2Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, UK, 3OpenBabel Development Team and 4Department of Chemistry, University of Pittsburgh, Chevron Science Center, 219 Parkman Avenue, Pittsburgh, PA 15260, USA Email: Noel M O'Boyle* - baoilleach@gmail.com; Chris Morley - c.morley@gaseq.co.uk; Geoffrey R Hutchison - geoffh@pitt.edu * Corresponding author Published: 9 March 2008 Received: 23 January 2008 Accepted: 9 March 2008 Chemistry Central Journal 2008, 2:5 doi:10.1186/1752-153X-2-5 This article is available from: http://journal.chemistrycentral.com/content/2/1/5 © 2008 O'Boyle et al This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: Scripting languages such as Python are ideally suited to common programming tasks in cheminformatics such as data analysis and parsing information from files. However, for reasons of efficiency, cheminformatics toolkits such as the OpenBabel toolkit are often implemented in compiled languages such as C++. We describe Pybel, a Python module that provides access to the OpenBabel toolkit. Results: Pybel wraps the direct toolkit bindings to simplify common tasks such as reading and writing molecular files and calculating fingerprints. Extensive use is made of Python iterators to simplify loops such as that over all the molecules in a file. A Pybel Molecule can be easily interconverted to an OpenBabel OBMol to access those methods or attributes not wrapped by Pybel. Conclusion: Pybel allows cheminformaticians to rapidly develop Python scripts that manipulate chemical information. It is open source, available cross-platform, and offers the power of the OpenBabel toolkit to Python programmers. Background OpenBabel is a C++ toolkit with extensive capabilities for Cheminformaticians often need to write once-off scripts reading and writing molecular file formats (over 80 are to create extract data from text files, prepare data for anal- supported) as well as for manipulating molecular data [2]. ysis or carry out simple statistics. Scripting languages such Many standard chemistry algorithms are included, for as Perl, Python and Ruby are ideally suited to these day- example, determination of the smallest set of smallest to-day tasks [1]. Such languages are, however, an order of rings, bond order perception, addition of hydrogens, and magnitude or more slower than compiled languages such assignment of Gasteiger charges. In relation to cheminfor- as C++. Since cheminformaticians regularly deal with matics, OpenBabel supports SMARTS searching [3], molecular files containing thousands of molecules and molecular fingerprints [4] (both Daylight-type, and struc- many cheminformatics algorithms are computationally tural-key based), and includes group contribution expensive, cheminformatics toolkits are typically written descriptors for LogP [5], polar surface area (PSA) [6] and in compiled languages for performance. molar refractivity (MR) [5]. Page 1 of 7 Chem. Cent. J. 2008, 2, 5. (page number not for citation purposes)
  • 8. Chemistry Central Journal 2008, 2:5 http://journal.chemistrycentral.com/content/2/1/5 Of the current popular scripting languages, Python [7] is header files, SWIG generates a C file which, when com- the de-facto standard language for scripting in cheminfor- piled and linked with the Python development libraries matics. Several commercial cheminformatics toolkits have and OpenBabel, creates a Python extension module, interfaces in Python: OpenEye's closed-source successor openbabel. This can then be imported into a Python script to OpenBabel, OEChem [8], is a C++ toolkit with inter- like any other Python module using the "import openbabel" faces in Python and Java; Rational Discovery's RDKit [9], statement. which is now open source, is a C++ cheminformatics toolkit with a Python interface; the Daylight toolkit [10] For a small number of C++ objects and functions, it was from Daylight Chemical Information Systems, written in necessary to add some convenience functions to facilitate C, only has Java and C++ wrappers but PyDaylight [11], access from Python. Certain types of molecule files have available separately from Dalke Scientific, provides a additional data present in addition to the connection Python interface to the toolkit; the Cambios Molecular table. OpenBabel stores these data in subclasses of OBGe- Toolkit [12] from Cambios Consulting is a commercial nericData such as OBPairData (for the data fields in mol- C++ toolkit with a Python interface. There are also toolkits ecule files such as MOL files and SDF files) and entirely implemented in Python: Frowns [13], an open OBUnitCell (for the data fields in CIF files). To access the source cheminformatics toolkit by Brian Kelley, and PyBa- data it is necessary to 'downcast' an instance of OBGener- bel [14], an open source toolkit included in the MGLTools icData to the specific subclass. For this reason, two con- package from the Molecular Graphics Labs at the Scripps venience functions were added to the interface file, one to Research Institute. Note that the latter is not related to the cast OBGenericData to OBPairData, and one to cast to OpenBabel project; rather its name derives from the fact OBUnitCell. Another convenience function was added to that its aim was to implement in Python some of the func- convert a Python list to a C array of doubles, as this type tionality of Babel v1.6 [15], a command-line application of input is required for a small number of OpenBabel for converting file formats which is a predecessor of functions. OpenBabel. Iterators are an important feature of the OpenBabel C++ Here we describe the implementation and application of library. For example, OBAtomAtomIter allows the user to Pybel, a Python module that provides access to the easily iterate over the atoms attached to a particular atom, OpenBabel C++ library from the Python programming and OBResidueIter is an iterator over the residues in a language. Pybel builds on the basic Python bindings to molecule. The OpenBabel iterators use the dereference make it easier to carry out frequent tasks in cheminformat- operator to access the data, the increment operator to iter- ics. It also aims to be as 'Pythonic' as possible; that is, to ate to the next element, and the boolean operator to test adhere to Python language conventions and idioms, and whether any elements remain. Iterators are also a core fea- where possible to make use of Python language features ture of the Python language. However, the iterators used such as iterators. The result is a module that takes advan- by OpenBabel are not automatically converted into tage of Python's expressive syntax to allow cheminforma- Python iterators. To deal with this, Python iterator classes ticians to carry out tasks such as SMARTS matching, data that wrap the dereference, increment and boolean opera- field manipulation and calculation of molecular finger- tors behind the scenes were added to the SWIG interface prints in just a few lines of code. file, so that Python statements such as "for attached_obatom in OBAtomAtomIter(obatom)" work with- Implementation out problem. SWIG bindings Python bindings to the OpenBabel toolkit were created Pybel module using SWIG [16]. SWIG (Simplified Wrapper and Inter- The SWIG bindings provide direct access from Python to face Generator) is a tool that automates the generation of the C++ objects and functions in the OpenBabel API bindings to libraries written in C or C++. One of the (application programming interface). The purpose of the advantages of SWIG compared to other automated wrap- Pybel module is to wrap these bindings to present a more ping methods such as Boost.Python [17] or SIP [18] is that Pythonic interface to OpenBabel (Figure 1). This extra SWIG also supports the generation of bindings to several level of abstraction is useful as Python programmers other languages. For example, OpenBabel also uses SWIG expect Python libraries to behave in certain ways that a to generate bindings for Perl, Ruby and Java. An addi- C++ library does not. For example, in Python, attributes of tional advantage is that SWIG will directly parse C or C++ an object are often directly accessed whereas in C++ it is header files while Boost.Python and SIP require each C++ typical to call Get/Set functions to access them. A C++ class to be exposed manually. The input to SWIG is an function returning a particular object might require a interface file containing a list of OpenBabel header files pointer to an empty object as a parameter, whereas the for which to generate bindings. Using the signatures in the Python equivalent would not. Even something as simple Page 2 of 7 Chem. Cent. J. 2008, 2, 5. (page number not for citation purposes)
  • 9. Chemistry Central Journal 2008, 2:5 http://journal.chemistrycentral.com/content/2/1/5 code shows how to store each molecule in a multimole- cule SDF file in a list called allmols: import openbabel allmols = [] obconversion = openbabel.OBConversion() obconversion.SetInFormat("sdf") obmol = openbabel.OBMol() notatend = obconversion.ReadFile(obmol, "inputfile.sdf") while notatend: allmols.append(obmol) obmol = openbabel.OBMol() notatend = obconversion.Read(obmol) To replace this somewhat verbose code, Pybel provides a readfile method that takes a file format and filename and returns molecules using the 'yield' keyword. This changes the method into a 'generator', a Python language feature where a method behaves like an iterator. Iterators are a major feature of the Python language which are used for looping over collections of objects. In Pybel, we have used iterators where possible to simplify access to the toolkit. As a result, the equivalent to the preceding code is: Figure text and1the OpenBabel C++ library The relationship between Python modules described in the The relationship between Python modules described import pybel in the text and the OpenBabel C++ library. Python modules are shown in green; the C++ library is shown in allmols = [mol for mol in pybel.read blue. file("sdf", "inputfile.sdf")] The benefits of iterator syntax are clear when dealing with as differences in the conventions for the case of letters multimolecule files. For single molecule files, however, used in variable and method names is a problem, as it the user needs to remember to explicitly request the itera- makes it more likely for Python programmers to intro- tor to return the first and only molecule using the next duce bugs in their code. method: One of the key aims of Pybel was to reduce the amount of mol = pybel.readfile("mol", "input code necessary to carry out common tasks. This is espe- file.mol").next() cially important for a scripting language where program- ming is often done interactively at a command prompt. In Pybel provides replacements for two of the main classes in addition, as for any programming language, repeated the OpenBabel library, OBMol and OBAtom. The follow- entry of code for routine and common tasks (so-called ing discussion describes the Pybel Molecule class which 'boilerplate code') is a common cause of errors in code. wraps an instance of OBMol, but the same design princi- Reading and writing molecule files is one of the most ples apply to the Pybel Atom class. Table 1 summarises common tasks for users of OpenBabel but requires several the attributes and methods of the Molecule object. By lines of code if using the SWIG bindings. The following wrapping the base class, Pybel can enhance the Molecule Page 3 of 7 Chem. Cent. J. 2008, 2, 5. (page number not for citation purposes)
  • 10. Chemistry Central Journal 2008, 2:5 http://journal.chemistrycentral.com/content/2/1/5 Table 1: Attributes and methods supported by the Pybel Molecule object Attribute Description* OBMol The underlying OBMol object atoms A list of Pybel Atoms charge The total charge (GetTotalCharge) data A MoleculeData object for access to data fields dim The dimensionality of the coordinates (GetDimension) energy The heat of formation (GetEnergy) exactmass The mass calculated using isotopic abundance (GetExactMass) flags The set of flags used internally by OpenBabel (GetFlags) formula The stoichiometric formula (GetFormula) mod The number of nested BeginModify() calls (Internal use) (GetMod) molwt The standard molar mass (GetMolWt) spin The total spin multiplicity (GetTotalSpinMultiplicity) sssr The smallest set of smallest rings (GetSSSR) title The title of the molecule (often the filename) (GetTitle) unitcell Unit cell data (if present) Method write Write the molecule to a file or return it as a string calcfp Return a molecular fingerprint as a Fingerprint object calcdesc Return the values of the group contribution descriptors __iter__ Enable iteration over the Atoms in the Molecule *Where a Molecule attribute is a direct replacement for a 'Get' method of the underlying OBMol, the name of the method is given in parentheses. object by providing (1) direct access to attributes rather # Using Pybel than through the use of Get methods, (2) additional attributes of the object, and (3) additional methods that value = pybel.Molecule(mol).data ["com act on the object. ment"] (1) As mentioned earlier, it is typical in Python to access It should be noted that all of these attributes are calculated attribute values directly rather than using Get/Set meth- on-the-fly rather than stored for future access as the under- ods. With this in mind, the Molecule class adds attributes lying OBMol may have been modified. such as energy, formula and molwt (among others) which give the values returned by calling GetEnergy(), GetFor- (3) Four additional methods have been added to the mula() and GetMolWt(), respectively on the underlying Pybel Molecule (Table 1). The first is a write method OBMol (see Table 1 for the full list). which writes a representation of the Molecule to a file and takes care of error handling. As with reading molecules (2) One of the aims of Pybel is to simplify access to some from files (see above), this method simplifies the proce- of the most common attributes. With this in mind, an dure significantly compared to using the SWIG bindings atoms attribute has been added which returns a list of the directly. In addition, a calcfp method and a calcdesc atoms of the molecule as Pybel Atoms. Access to the data method have been added which calculate a binary finger- fields associated with a molecule has been simplified by print for the molecule, and some descriptor values, respec- creation of a MoleculeData object which is returned when tively. In the OpenBabel library these are not methods of the data attribute of a Molecule is accessed. MoleculeData the OBMol, but rather are loaded as plugins (by OBFin- presents a dictionary interface to the data fields of the gerprint.FindFingerprint and OBDescriptor.FindType, molecule. Accessing and updating these field is more con- respectively) to which an OBMol is passed as input. The voluted if using the SWIG bindings. Compare the follow- __iter__ method is a special Python method that enables ing statements for accessing the "comment" field of the iteration over an object; in the case of a Molecule, the variable mol, an OBMol: defined iterator loops over the Atoms of the Molecule. This feature enables constructions such as "for atom in # Using the SWIG bindings mol" where mol is a Pybel Molecule. value = openbabel.toPairData(mol.GetData SMARTS is a query language developed by Daylight ["comment"]).GetValue() Chemical Information Systems for molecular substructure Page 4 of 7 Chem. Cent. J. 2008, 2, 5. (page number not for citation purposes)
  • 11. Chemistry Central Journal 2008, 2:5 http://journal.chemistrycentral.com/content/2/1/5 searching [3]. As implemented in the OpenBabel toolkit, The OBMol wrapped by a Pybel Molecule can be accessed finding matches of a particular substructure in a particular through the OBMol attribute. This makes it easy to call a molecule is a four step process that involves creating an method not wrapped by Pybel, such as OBMol.NumRotors, instance of OBSmartsPattern, initialising it with a which returns the number of rotatable bonds in a mole- SMARTS pattern, searching for a match, and finally cule: retrieving the result: mol = pybel.readfile("mol", "input obsmarts = openbabel.OBSmartsPattern() file.mol").next() obsmarts.Init("[#6] [#6]") numrotors = mol.OBMol.NumRotors() obsmarts.Match(obmol) Documentation and Testing To minimise programming errors, programs written results = obsmarts.GetUMapList() dynamically-typed languages such as Python should be tested comprehensively. Pybel has 100% code coverage in Since a SMARTS query can be thought of as a regular terms of unit tests, as measured by Ned Batchelder's cov- expression for molecules, in Pybel we decided to wrap the erage.py [19]. It also has several doctests, short snippets of SMARTS functionality in an analogous way to Python's Python code included in documentation strings which regular expression module, re. With these changes, the serve as both examples of usage and as unit tests. same process takes only two steps, an initialisation step and a search step: The Pybel API is fully documented with docstrings. These can be accessed in the usual way with the help() com- smarts = pybel.Smarts("[#6] [#6]") mand at the interactive Python prompt after importing Pybel: for example, "help(pybel.Molecule)". In addition, the results = smarts.findall(pybelmol) OpenBabel Python web page [20] contains a complete description of how to use the SWIG bindings and the Pybel was not written to replace the SWIG bindings but Pybel API. The webpage also contains links to HTML ver- rather to make it simpler to perform common tasks. As a sions of the OpenBabel API documentation and Pybel API result, Pybel does not attempt to wrap every single documentation. The latter is included in Additional File 1. method and class in the OpenBabel library. Because of this, a user may often want to interconvert between an Results and Discussion OBMol and a Molecule, or an OBAtom and an Atom. This The principle aim of Pybel is to make it simpler to use the is quite a straightforward process. A Pybel Molecule can be OpenBabel toolkit to carry out common tasks in chem- created by passing an OBMol to the Molecule constructor. informatics. These common tasks include reading and In the following example an OBMol is created using the writing molecule files, accessing data fields of a molecule, SWIG bindings and then written to a file using Pybel: computing and comparing molecular fingerprints and SMARTS matching. Here we present some examples that obmol = openbabel.OBMol() illustrate how Pybel may be used to carry out common cheminformatics tasks. a = obmol.NewAtom() Removal of duplicate molecules a.SetAtomicNum(6) When merging different datasets or as a final step in pre- processing, it may be necessary to identify and remove a.SetVector(0.0, 1.0, 2.0) # Set coordi duplicate molecules. In the following example, only the nates unique molecules in the multimolecule SDF file "input- file.sdf" will be written to "uniquemols.sdf". Here we will b = obmol.NewAtom() assume that a unique InChI string (IUPAC International Chemical Identifier) indicates a unique molecule. A simi- obmol.AddBond(1, 2, 1) # Single bond from lar procedure could be performed using the OpenBabel Atom 1 to Atom 2 canonical SMILES format, by replacing "inchi" with "can" in the following: pybel.Molecule(obmol).write("mol", "out putfile.mol") import pybel inchis = [] Page 5 of 7 Chem. Cent. J. 2008, 2, 5. (page number not for citation purposes)
  • 12. Chemistry Central Journal 2008, 2:5 http://journal.chemistrycentral.com/content/2/1/5 output = pybel.Outputfile("sdf", ties. This is the Lipinski Rule of Fives, so-called as the "uniquemols.sdf") numbers involved are all multiples of five. The following example shows how to filter a database to identify only for mol in pybel.readfile("sdf", "input those molecules that pass all four of the Lipinski criteria. file.sdf"): The values of the Lipinski descriptors are also added to the output file as data fields. Note that whereas molecular inchi = mol.write("inchi") weight is directly available as an attribute of a Molecule, and LogP is available as one of the three group contribu- if inchi not in inchis: tion descriptors calculated by OpenBabel, we need to use SMARTS pattern matching to identify the number of output.write(mol) hydrogen bond donors and acceptors. The SMARTS pat- terns used here correspond to the definitions of hydrogen inchis.append(inchi) bond donor and acceptor used by Lipinski: output.close() import pybel Selection of similar molecules HBD = pybel.Smarts("[#7,#8;!H0]") Another common task in cheminformatics is the selection of a set of molecules of similar structure to a target mole- HBA = pybel.Smarts("[#7,#8]") cule. Here we will assume that structural similarity is indi- cated by a Tanimoto coefficient [21] of at least 0.7 with def lipinski(mol): respect to Daylight-type (that is, based on hashed paths through the molecular graph) fingerprints. Note that """Return the values of the Lipinski Pybel redefines the | operator (bitwise OR) for Fingerprint descriptors.""" objects as the Tanimoto coefficient: desc = {'molwt': mol.molwt, import pybel 'HBD': len(HBD.findall(mol)), targetmol = pybel.readfile("sdf", "target mol.sdf").next() 'HBA': len(HBA.findall(mol)), targetfp = targetmol.calcfp() 'LogP': mol.calcdesc(['LogP']) ['LogP']} output = pybel.Outputfile("sdf", "similar mols.sdf") return desc for mol in pybel.readfile("sdf", "input passes_all_rules = lambda desc: (desc file.sdf"): ['molwt'] <= 500 and fp = mol.calcfp() desc ['HBD'] <= 5 and desc ['HBA'] <= 10 and if fp | targetfp >= 0.7: desc ['LogP'] <= 5) output.write(mol) if __name__=="__main__": output.close() output = pybel.Outputfile("sdf", "pas Applying a Rule of Fives filter sLipinski.sdf") In an influential paper, Lipinski et al. [22] performed an analysis of drug compounds that reached Phase II clinical for mol in pybel.readfile("sdf", trials and found that they tended to occupy a certain range "inputfile.sdf"): of values for molecular weight, LogP, and number of hydrogen bond donors and acceptors. Based on this, they descriptors = lipinski(mol) proposed a rule with four criteria to identify molecules that might have poor absorption or permeation proper- if passes_all_rules(descriptors): Page 6 of 7 Chem. Cent. J. 2008, 2, 5. (page number not for citation purposes)
  • 13. Chemistry Central Journal 2008, 2:5 http://journal.chemistrycentral.com/content/2/1/5 mol.data.update(descriptors) Additional material output.write(mol) Additional file 1 Pybel API. The HTML documentation of the Pybel API (application pro- output.close() gramming interface). Click here for file Future work [http://www.biomedcentral.com/content/supplementary/1752- The future development of Pybel is closely linked to any 153X-2-5-S1.zip] changes and improvements to OpenBabel. With each new release of the OpenBabel API, the SWIG bindings will be updated to include any additional functionality. How- ever, additions to the Pybel API will only occur if they sim- Acknowledgements plify access to new features of the OpenBabel toolkit of The idea for the Pybel module was inspired by Andrew Dalke's work on PyDaylight [11]. We thank the anonymous reviewers for their helpful com- general use to cheminformaticians. In general, the Pybel ments. API can be considered stable, and an effort will be made to ensure that future changes will be backwards compati- References ble. 1. Ousterhout JK: Scripting: Higher Level Programming for the 21st Century. [http://home.pacbell.net/ouster/scripting.html]. Conclusion 2. OpenBabel v.2.1.1 [http://openbabel.sf.net] 3. SMARTS – A Language for Describing Molecular Patterns Pybel provides a high-level Python interface to the widely- [http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html] used OpenBabel C++ toolkit. This combination of a high 4. Flower DR: On the properties of bit string-based measures of chemical similarity. J Chem Inf Comput Sci 1998, 38:379-386. performance cheminformatics toolkit and an expressive 5. Wildman SA, Crippen GM: Prediction of physicochemical scripting language makes it easy for cheminformaticians parameters by atomic contributions. J Chem Inf Comput Sci to rapidly and efficiently write scripts to manipulate 1999, 39:868-873. 6. Ertl P, Rohde B, Selzer P: Fast calculation of molecular polar molecular data. surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. Pybel is freely available from the OpenBabel web site2 J Med Chem 2000, 43:3714-3717. 7. Python [http://www.python.org] both as part of the OpenBabel source distribution and for 8. OEChem: OpenEye Scientific Software: Santa Fe, NM. . Windows as an executable installer. Compiled versions 9. RDKit [http://www.rdkit.org] 10. Daylight Toolkit: Daylight Chemical Information Systems, are also available as packages in some Linux distributions Inc.: Aliso Viejo, CA. . (openbabel-python in Fedora, for example). 11. PyDaylight: Dalke Scientific Software, LLC: Santa Fe, NM. . 12. Cambios Molecular Toolkit: Cambios Computing, LLC: Palo Alto, CA. . Availability and Requirements 13. Frowns [http://frowns.sf.net] Project name: Pybel 14. PyBabel in MGLTools [http://mgltools.scripps.edu] 15. Babel v.1.6 [http://smog.com/chem/babel/] 16. SWIG v.1.3.31 [http://www.swig.org] Project home page: http://openbabel.sf.net/wiki/Python 17. Boost.Python [http://www.boost.org/libs/python/doc/] 18. SIP – A Tool for Generating Python Bindings for C and C++ Libraries [http://www.riverbankcomputing.co.uk/sip/] Operating system(s): Platform independent 19. coverage.py [http://nedbatchelder.com/code/modules/cover age.html] Programming language: Python 20. OpenBabel Python [http://openbabel.sourceforge.net/wiki/ Python] 21. Jaccard P: La distribution de la flore dans la zone alpine. Rev Other requirements: OpenBabel Gen Sci Pures Appl 1907, 18:961-967. 22. Lipinski CA, Lombardo F, Dominy BW, Feeney PJ: Experimental and computational approaches to estimate solubility and License: GNU GPL permeability in drug discovery and development settings. Adv Drug Del Rev 1997, 23:3-25. Any restrictions to use by non-academics: None Authors' contributions GRH is the lead developer of OpenBabel and created the SWIG bindings. NMOB developed Pybel, and extended the SWIG interface file. CM compiled the SWIG bindings on Windows and added convenience functions to the OpenBabel API to facilitate access from scripting lan- guages. All authors read and approved the final manu- script. Page 7 of 7 Chem. Cent. J. 2008, 2, 5. (page number not for citation purposes)
  • 15. Chemistry Central Journal Software Open Access Cinfony – combining Open Source cheminformatics toolkits behind a common interface Noel M O'Boyle*1 and Geoffrey R Hutchison2 Address: 1Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, UK and 2Department of Chemistry, University of Pittsburgh, Chevron Science Center, 219 Parkman Avenue, Pittsburgh, PA 15260, USA Email: Noel M O'Boyle* - oboyle@ccdc.cam.ac.uk; Geoffrey R Hutchison - geoffh@pitt.edu * Corresponding author Published: 3 December 2008 Received: 9 October 2008 Accepted: 3 December 2008 Chemistry Central Journal 2008, 2:24 doi:10.1186/1752-153X-2-24 This article is available from: http://journal.chemistrycentral.com/content/2/1/24 © 2008 O'Boyle et al Abstract Background: Open Source cheminformatics toolkits such as OpenBabel, the CDK and the RDKit share the same core functionality but support different sets of file formats and forcefields, and calculate different fingerprints and descriptors. Despite their complementary features, using these toolkits in the same program is difficult as they are implemented in different languages (C++ versus Java), have different underlying chemical models and have different application programming interfaces (APIs). Results: We describe Cinfony, a Python module that presents a common interface to all three of these toolkits, allowing the user to easily combine methods and results from any of the toolkits. In general, the run time of the Cinfony modules is almost as fast as accessing the underlying toolkits directly from C++ or Java, but Cinfony makes it much easier to carry out common tasks in cheminformatics such as reading file formats and calculating descriptors. Conclusion: By providing a simplified interface and improving interoperability, Cinfony makes it easy to combine complementary features of OpenBabel, the CDK and the RDKit. Background In general, all of these toolkits share the same core func- Cheminformatics toolkits are essential to the day-to-day tionality although the implementation details and under- work of the practising cheminformatician. They enable lying chemical model may differ. However, as a result of the user to deal with such tasks as handling different their independent development and history, each has chemistry file formats, substructure searching, calculation functionality specific to itself and each toolkit supports of molecular fingerprints, and structure diagram genera- different sets of file formats and forcefields, and can calcu- tion. The main Open Source cheminformatics libraries late different molecular fingerprints and molecular under active development are OpenBabel [1], the Chem- descriptors (Table 1). Despite the diversity of these istry Development Kit (CDK) [2], and the RDKit [3]. toolkits and the potential benefits in being able to access OpenBabel is a C++ toolkit with bindings in Perl, Python, all of them at the same time, there has been little work on Ruby and Java, the CDK is a Java toolkit, while the RDKit interoperability between them. This has resulted in a bal- is another C++ toolkit with Python bindings. While the kanization of this field such that users of one toolkit rarely CDK has its origins in academia, both OpenBabel and the use another toolkit. RDKit originated in companies (OpenEye and Rational Discovery, respectively) and have subsequently been One way to achieve interoperability of chemical toolkits is developed by the community under Open Source licenses. through the use of standard file formats for exchange of Page 1 of 10 Chem. Cent. J. 2008, 2, 24. (page number not for citation purposes)
  • 16. Chemistry Central Journal 2008, 2:24 http://journal.chemistrycentral.com/content/2/1/24 Table 1: Some features of toolkits which are not shared by all three toolkits. CDK A large number of descriptors (some overlap with RDKit) Pharmacophore searching (like RDKit*) Calculation of maximum common substructure 2D structure layout (like RDKit) and depiction MACCS keys (also RDKit) and E-State fingerprints Integration with the R statistical programming environment Support for mass-spectrometry analysis (representations for cleavage reactions, structure generation from formulae) Fragmentation schemes (ring fragments, Murcko) 3D structure generation using a template and heuristics (like OpenBabel) 3D similarity using ultrafast shape descriptors Gasteiger π charge calculation OpenBabel Not just focused on cheminformatics Supports a very large number of chemical file formats including quantum mechanics file formats, molecular mechanics trajectories, 2D sketchers 3D structure generation using a template method (like CDK) Included in all major Linux distributions Bindings available from several scripting languages apart from Python, as well as the Java and .NET platforms Conformation generation and searching InChI (also CDK) and InChIKey generation Support for crystallographic space groups Several forcefield implementations: UFF (also RDKit), MMFF94, MMFF94s, Ghemical Ability to add custom data types to atoms, bonds, residues, molecules RDKit A large number of descriptors (some overlap with CDK) Fragmentation using RECAP rules 2D coordinate generation (like CDK) and depiction 3D coordinate generation using geometry embedding Calculation of Cahn-Ingold-Prelog stereochemistry codes (R/S) Pharmacophore searching (like CDK) Calculation of shape similarity (based on volume overlap) Chemical reaction handling and transforms Atom pairs and topological torsions fingerprints Feature maps and feature-map vectors Machine-learning algorithms * Where the term "like" is used, it indicates that the implementation details differ. data. For example, the CML project has defined a stand- models between different toolkits, and differences in the ardised XML format for chemical data [4], with successive API for core cheminformatics tasks shared by the toolkits. releases refining and extending the original standard. The OpenSMILES effort [5] has attempted to resolve ambigui- Here we describe Cinfony, a Python module that over- ties in the published SMILES definition [6] to create a comes these barriers to provide interoperability at the API standard. While these efforts deserve support, they face level. Cinfony allows access to OpenBabel, the CDK, and inevitable problems achieving consensus and they require the RDKit through a common interface, and uses a simple changes to existing software to support the standard. The yet robust method to pass chemical models between large number of chemical file formats supported by toolkits. Pybel, one of the components of Cinfony, has OpenBabel (currently over 80) illustrates both the poten- been described previously [7]. It provides access to tial of achieving a standard as well as the difficulties. OpenBabel from standard Python. In this work, we show that the API developed for Pybel may be considered a An alternative is interoperability at the API (application generic API for accessing any cheminformatics toolkit. We programming interface) level. This has the advantage that describe the design and implementation of the Cinfony it does require any changes to existing software. However, API for OpenBabel, the RDKit and the CDK. Next, we there are at least three barriers to overcome: the need for a show how Cinfony simplifies the process of accessing the programming language that can access all the toolkits toolkits and how it can be used in practice to combine the simultaneously, the difficulty of exchanging chemical power of the three Open Source toolkits. Finally, we dis- Page 2 of 10 Chem. Cent. J. 2008, 2, 24. (page number not for citation purposes)
  • 17. Chemistry Central Journal 2008, 2:24 http://journal.chemistrycentral.com/content/2/1/24 cuss performance and some results from comparisons of Although the OBMol of OpenBabel has a corresponding the toolkits. method, OBMol.AddHydrogens(), the RDKit uses a glo- bal method, AddHs(Mol), while the CDK requires the Implementation user to instantiate a HydrogenAdder object, which can Common Application Programming Interface then be used to add hydrogens. Cinfony presents the same interface to three cheminfor- matics toolkits, OpenBabel, the CDK and the RDKit. The Molecule methods described in the original Pybel API These are available through three separate modules: oba- [7] have been extended to handle hydrogen addition and bel, cdk and rdkit. The API is designed to make it easy to removal, structure diagram generation, assignment of 3D carry out many of the common tasks in cheminformatics, geometry to 0D structures and geometry optimisation and covers the core functionality shared by all of the using forcefields. Both the CDK and the RDKit are capable toolkits. Table 2 gives an overview of the API. The com- of 2D coordinate generation and 2D depiction. However, plete API is available here (see Additional file 1). since OpenBabel currently has neither of these capabili- ties, a fourth toolkit, OASA, is used by Pybel for this pur- The main class containing chemical information is the pose. OASA is a lightweight cheminformatics toolkit Molecule class. Rather than create a new chemical model, implemented in Python [8]. the Molecule class is a light wrapper around the molecule object in the underlying library, for example, around A new development in the latest version of OpenBabel is OBMol in the case of OpenBabel. Attribute values such as 3D coordinate generation and geometry optimisation the molecular weight are calculated dynamically by query- using one of a number of forcefields. Since these methods ing the underlying molecule. This ensures that if the are also available in the RDKit, and are under develop- underlying OBMol, for example, is altered, the attribute ment in the CDK, two additional methods have been values returned will still be correct. The actual underlying added to the Cinfony Molecule: make3D(), for 3D coor- object (an OpenBabel OBMol, a CDK Molecule, or an dinate generation, and localopt(), for geometry optimisa- RDKit Mol) can be accessed directly at any point. tion. Particularly in the case of OpenBabel, these new methods simplify the process of generating 3D coordi- The Molecule class also contains several methods that act nates. Compare a single call to make3D() in Cinfony with on molecules such as methods for calculating fingerprints, the following OpenBabel code: adding hydrogens, and calculating descriptor values. This makes it easy to access these methods, and also brings structuregenerator = openbabel.OBOp.Find them to the attention of the user. In the underlying toolkit Type('Gen3D') these methods may not be present as part of the molecule class, and in fact, they can be difficult to find in the structuregenerator.Do(mol) toolkit's API. For example, the Cinfony method Mole- cule.addh() adds explicit hydrogens to the molecule. mol.AddHydrogens() Table 2: An overview of the Cinfony API. Class name Purpose Molecule Wraps a molecule instance of the underlying toolkit and provides access to methods that act on molecules Atom Wraps an atom instance of the underlying toolkit MoleculeData Provides dictionary-like access to the information contained in the tag fields in SDF and MOL2 files Outputfile Handles multimolecule output file formats Smarts Wraps the SMARTS functionality of the toolkit in an analogous way to the Python 're' module for regular expression matching Fingerprint Simplifies Tanimoto calculation of binary fingerprints Function name readfile Return an iterator over Molecules in a file readstring Return a Molecule Variable name descs A list of descriptor IDs forcefields A list of forcefield IDs fps A list of fingerprint IDs informatsaa A list of input format IDs outformats A list of output format IDs Page 3 of 10 Chem. Cent. J. 2008, 2, 24. (page number not for citation purposes)
  • 18. Chemistry Central Journal 2008, 2:24 http://journal.chemistrycentral.com/content/2/1/24 ff = openbabel.OBForceField.Find translation process is transparent to the user. However, Type("MMFF94") the user should be aware of known limitations of particu- lar readers or writers. For example, the SMILES parser in ff.Setup(mol) CDK 1.0.3 ignores atom-based stereochemistry and thus that information is lost if a 0D rdkit or obabel Molecule ff.SteepestDescent(50) with atom-based stereochemistry is converted to a cdk Molecule. ff.GetCoordinates(mol) Cinfony Molecules are interconverted using the Mole- The Cinfony API is identical for all of the toolkits. How- cule() constructor. For example, if obabelmol is an obabel ever, the values returned by particular API calls are not Molecule, then the corresponding rdkit Molecule can be necessarily standardised across toolkits. This Cinfony constructed using rdkit.Molecule(pybelmol). This mecha- design decision is in agreement with the Principle of Least nism can also be used to interface Cinfony to other chem- Surprise [9]; when the user accesses the underlying toolkit informatics toolkits. The only requirements are that the directly, they will get the same result as found when using object passed to the Molecule() constructor needs to have Cinfony. This design decision places the responsibility on a _cinfony attribute set to True, and an _exchange the user to become familiar with differences in how the attribute containing a tuple (0, SMILES string) or (1, MOL toolkits behave. For example, all of the toolkits allow the file) depending on whether the molecule is 0D or not. calculation of path-based fingerprints. These encode all paths in the molecular graph up to a path length of P into Implementation a binary vector of length V, but the default values for V The Python scripting language has two main implementa- and P are different for each toolkit: 1024 and 7 for tions. The most widely used implementation is the origi- OpenBabel, 1024 and 8 for the CDK, and 2048 and 7 for nal reference implementation of Python in C, referred to RDKit. Although it is possible to alter these parameters for as CPython when necessary to distinguish it from other the CDK and the RDKit and so standardise V and P to implementations. The next most widely used implemen- 1024 and 7 for all of the toolkits, it is reasonable to tation is Jython, an implementation of Python in Java. assume that the developers of each package have chosen Although most users of Python do so through CPython, sensible defaults. In addition, the implementation details Jython scripts have the advantage of being able to access of each of the fingerprinters would still be different; for Java libraries natively. They can also be compiled into Java example, the RDKit sets four bits when hashing each classes to be used from Java programs. Jython scripts are molecular path, the others set one; OpenBabel does not also useful in contexts where Java is required but it is more set any bits for the one-atom fragments, N, C and O. convenient to work in Python; for example, to implement a Java web servlet or a node in a Java workflow environ- Interoperability ment such as KNIME [11]. The ability to transfer chemical models between toolkits is essential to the goal of interoperability. However, the As discussed earlier, one of the barriers to interoperability internal representation of a molecule is specific to a par- is the requirement for a programming language that can ticular toolkit. For example, as well as the connection simultaneously access more than one of the toolkits. From table and coordinates (if present), it may include derived CPython it is possible to use Cinfony modules to connect data relating to aromaticity, the number of implicit hydro- to OpenBabel (pybel), the CDK (cdkjpype) and the RDKit gens on an atom, or stereochemical configuration. Fortu- (rdkit). From Jython, there are modules for OpenBabel nately, the problem of transfer and storage of chemical (jybel) and the CDK (cdkjython). Convenience modules information has already been solved by the development obabel and cdk are provided that automatically import the of molecular file formats, of which over 80 are now sup- appropriate OpenBabel or CDK module depending on ported by OpenBabel. Specifically, the MDL MOL file for- the Python implementation. The relationship between mat [10] and the SMILES format [5,6] are shared by all these Cinfony modules and the underlying cheminfor- three toolkits, and are used by Cinfony to exchange infor- matics libraries is summarised in Figure 1. mation on molecules with 2D or 3D coordinates (MOL file format), and no coordinates (SMILES format), respec- pybel and jybel tively. OpenBabel provides SWIG [12] bindings for both CPy- thon and Java (among other languages). pybel is a wrapper By using existing file formats rather than trying to inter- around the CPython bindings, and has previously been convert the internal models themselves, Cinfony takes described in detail [7]. jybel is an implementation of the advantage of the existing input/output code of each Cinfony API that allows the user to access OpenBabel toolkit which is well-tested and mature. In addition, the from Jython using the Java bindings. Despite the fact that Page 4 of 10 Chem. Cent. J. 2008, 2, 24. (page number not for citation purposes)
  • 19. Chemistry Central Journal 2008, 2:24 http://journal.chemistrycentral.com/content/2/1/24 rdkit Support for Python scripting has been part of the design of the RDKit from the start. The Python bindings in RDKit were created using Boost.Python [14], a framework for interfacing Python and C++. The Cinfony module rdkit uses these bindings to implement its API. It is currently not possible to access RDKit from Jython. RDKit has only preliminary support for Java bindings; when these are complete, a corresponding module will be added to Cin- fony. Dependency handling A fully-featured installation of Cinfony relies on a large Figure 1 Relationship of Cinfony modules to Open Source toolkits number of open source libraries. In particular, the 2D Relationship of Cinfony modules to Open Source depiction capabilities introduce dependencies on several toolkits. Python modules are accessible from CPython graphics libraries which may be problematic to install on (green), Jython (pale blue), or both (striped green and pale a particular platform (Cairo and its Python bindings, blue). Java libraries are indicated by dark blue, while C++ Python Imaging Library, AGG and the Python wrapper libraries are yellow. AggDraw). With this in mind, Cinfony treats all depend- encies as optional and only raises an Exception if the user calls a method or imports a module that requires a miss- ing dependency. jybel is used from a Java implementation of Python, and For example, the Python Imaging Library (PIL) is required accesses a C++ library through the Java Native Interface for displaying a 2D depiction on the screen. If all of the (JNI), the jybel code differs from pybel in very few respects. components of cinfony are installed except for PIL, Cin- In Jython, it is not possible to iterate directly over the fony works perfectly except that an Exception is raised if wrapped STL vectors used by OpenBabel as their Java the Molecule.draw() method is called with show = True SWIG bindings do not implement the Iterable interface. (the default). The image can however be written to a file Also, the current Jython implementation is 2.2 and does without problems (show = False, filename = not support generator expressions, which were introduced "image.png"). Similarly, if a user is only interested in in Python 2.4. Although both C++ and Python have the using the CDK and the RDKit, it is not necessary to install concept of a global function or variable, this is not the OpenBabel. case in Java. SWIG places such functions, and get/set methods for accessing the variables, in a special class Full installation instructions for Windows, MacOSX and named openbabel. Global constants are placed in another Linux are available from the Cinfony website. It should be class called openbabelConstants. A convenience module, noted that for Windows users, there is no need to compile obabel, is provided which automatically imports the or search for missing libraries as the dependencies are appropriate module depending on the Python implemen- included as binaries in the Cinfony distribution. tation. Results cdkjpype and cdkjython Cinfony API Since Jython runs on top of the Java Virtual Machine The original Pybel API was designed to make it easy to use (JVM), it can access Java libraries such as the CDK OpenBabel to perform the most common tasks in chem- natively. To access Java libraries from CPython, the informatics and to do so using idiomatic Python. Subse- Python library JPype [13] is needed. This starts an instance quently, we realised that the resulting API could be of the JVM and uses the JNI to communicate back and considered a generic API for wrapping the core function- forth. Overall, the differences between the two wrappers ality of any cheminformatics toolkit. Cinfony implements are minor. Jython and JPype differ in the syntax used to an extended version of the original Pybel API for the CDK handle Java exceptions. Also, JPype returns unicode and the RDKit, as well as OpenBabel. While the original strings from the CDK and these need to be converted to Pybel was restricted to CPython, Cinfony can also be used regular strings (otherwise problems arise if they are passed from Jython to access the CDK and OpenBabel. to an OpenBabel method expecting a std::string). The appropriate CDK wrapper, cdkjpype or cdkjython, will be Cinfony helps cheminformaticians avoid the steep learn- imported if the user imports the convenience module cdk. ing curve associated with starting to use a new toolkit. Page 5 of 10 Chem. Cent. J. 2008, 2, 24. (page number not for citation purposes)
  • 20. Chemistry Central Journal 2008, 2:24 http://journal.chemistrycentral.com/content/2/1/24 With Cinfony, all of the core functionality of the toolkits targetfp = targetmol.calcfp() can be accessed with the same interface. For example, in Cinfony, a molecule can be created from a SMILES string output = cdk.Outputfile("sdf", "similar with: mols.sdf") mol = toolkit.readstring("smi", SMI for mol in cdk.readfile("sdf", "input LESstring) file.sdf"): RDKit fp = mol.calcfp() mol = Chem.MolFromSmiles(SMILESstring) if fp | targetfp >= 0.7: OpenBabel output.write(mol) mol = openbabel.OBMol() output.close() obconversion = openbabel.OBConversion() Alternatively, we could just have made a single change to the original script, by replacing the import statement from obconversion.SetInFormat("smi") "import pybel" with "from cinfony import cdk as pybel". obconversion.ReadString(mol, SMI Using Cinfony to combine toolkits LESstring) Another goal of Cinfony is to make it easy to combine toolkits in the same script. This allows the user to exploit CDK the complementary capabilities of different toolkits (Table 1). For example, let's suppose the user wants to (1) builder = cdk.DefaultChemObject convert a SMILES string to 3D coordinates with OpenBa- Builder.getInstance() bel, then (2) create a 2D depiction of that molecule with the RDKit, next (3) calculate descriptors with the CDK, sp = cdk.smiles.SmilesParser(builder) and finally (4) write out an SDF file containing the descriptor values and the 3D coordinates. The full Python mol = sp.parseSmiles(SMILESstring) script is only seven lines long: The RDKit was designed with Python scripting in mind, from cinfony import rdkit, cdk, obabel and of the three toolkits is the most concise. On the other hand, OpenBabel uses a characteristically C++ approach. mol = obabel.readstring("smi", "CCC=O") An empty molecule is created, and is passed to an OBCon- version instance as a container for the molecule read from mol.make3D() the SMILES string. The SmilesParser in the CDK requires an instance of an object implementing the IChemObject- rdkit.Molecule(mol).draw(show = False, Builder interface. filename = "aldehyde.png") Another advantage of a common API is that a script writ- descs = cdk.Molecule(mol).calcdesc() ten for one toolkit can easily be modified to use another. As an example, here is a script that selects molecules that mol.data.update(descs) are similar to a particular target molecule. This script is taken from the original Pybel paper [7], but uses the CDK mol.write("sdf", filename = "alde instead of OpenBabel and will run equally well from hyde.sdf") Jython and CPython. The only differences compared to the original script are that "pybel" has been replaced with For cheminformaticians interested in developing QSAR or "cdk", and the import statement has been changed from QSPR models, Cinfony can be used to simultaneously cal- "import pybel": culate descriptors from the RDKit, the CDK and OpenBa- bel. For example, the following script reads a multiline from cinfony import cdk input file, with each line consisting of a SMILES string fol- lowed by a property value. For each molecule, it calculates targetmol = cdk.readfile("sdf", "target all of the OpenBabel, RDKit and CDK descriptors (except mol.sdf").next() for CDK's CPSA) and writes out the results as a tab-sepa- Page 6 of 10 Chem. Cent. J. 2008, 2, 24. (page number not for citation purposes)
  • 21. Chemistry Central Journal 2008, 2:24 http://journal.chemistrycentral.com/content/2/1/24 rated file suitable for reading with the statistical package R print >> outputfile, "t".join(["Prop [15]. Note that in this example script, if descriptors share erty"] + descnames) the same name only one is retained. This is the case for the TPSA descriptor in OpenBabel, which is replaced by the for smile, propval, desc in zip(smiles, RDKit's TPSA descriptor. propvals, descs): import string descvals = [str(desc[descname]) for descname in descnames] from cinfony import obabel, cdk, rdkit print >> outputfile, "t".join([smile, # Read in SMILES strings and observed prop str(propval)] + erty values descvals) smiles, propvals = [], [] outputfile.close() for line in open("data.txt"): Performance broken = line.rstrip().split() Accessing cheminformatics libraries using Cinfony allows the user to rapidly develop scripts that manipulate chem- smiles.append(broken [0]) ical information. However, there is a small price to be paid. Firstly, there is the cost of moving objects across the propvals.append(float(broken)) interface between Python and the cheminformatics librar- ies. Secondly, the additional code required by Cinfony to mols = [obabel.readstring("smi", smile) implement a standard API may slow performance further. for smile in smiles] To assess the performance penalty for accessing chem- # Calculate descriptor values using informatics toolkits using Cinfony rather than directly in OpenBabel, the native language, we looked at two simple test cases: (1) iterating over an SDF file containing 25419 molecules, # the CDK (apart from 'CPSA') and the RDKit (2) iterating and printing out the molecular weight of each of the molecules. The SDF file used was 3_p0.0.sdf, cdkdescs = [x for x in cdk.descs if x != the first portion of the drug-like subset of the ZINC 7.00 'CPSA'] dataset [16]. The Cinfony scripts, Java and C++ source code are available as Additional file 2. The results are descs = [] shown in Table 3. for mol in mols: While accessing the CDK using Jython is almost as fast as a pure Java implementation, there is a considerable over- d = mol.calcdesc() head associated with using JPype to access the CDK from CPython (89% slower for the second test case). This over- d.update(cdk.Molecule(mol).calcdesc(cd head is due to passing objects between the JVM and CPy- kdescs)) thon. For OpenBabel, there is little performance cost associated with accessing OpenBabel from either imple- d.update(rdkit.Molecule(mol).calcdesc( mentation of Python, although the jybel scripts are some- )) what slower than pybel scripts. A small portion of this speed difference can be attributed to a slower startup descs.append(d) (about 1.6 seconds for jybel, compared to 0.8 seconds for pybel). Finally, from the RDKit results in Table 3, it is clear # Write a file suitable for 'read.table' that using Boost.Python to wrap a C++ library is more effi- in R cient than using SWIG. The difference in run times between the C++ and Python implementations is negligi- outputfile = open("inputforR.txt", "w") ble. descnames = sorted(descs [0].keys(), key = In practice, the performance of a particular Cinfony script string.lower) will depend on the extent to which information is passed Page 7 of 10 Chem. Cent. J. 2008, 2, 24. (page number not for citation purposes)
  • 22. Chemistry Central Journal 2008, 2:24 http://journal.chemistrycentral.com/content/2/1/24 Table 3: Performance of Cinfony modules compared to a native Java or C++ implementation. Iterate over SDF Iterate and calculate molecular weight CDK Time (s) Normalised Time (s) Normalised Native Java 21.2 1.00 36.8 1.00 cdkjython 23.1 1.09 41.6 1.13 cdkjpype 33.0 1.57 69.5 1.89 OpenBabel Native C++ 31.9 1.00 43.0 1.00 pybel 34.1 1.07 45.1 1.05 jybel 38.0 1.19 49.6 1.15 RDKit Native C++ 99.7 1.00 100.7 1.00 rdkit 99.9 1.00 101.0 1.00 The times reported are wallclock times from the best of three runs on a dual-core Intel Pentium 4 3.2 GHz machine with 1GB RAM. back and forth between Python and the underlying Java or ticomponent molecules. For each molecule, PubChem C++ library. Where most of the time is spent on computa- provides an SDF file containing coordinates for a 2D tion in the underlying library, the speed difference depiction, as well as the depiction itself as a PNG file. between a native implementation and one using Cinfony PubChem uses the CACTVS toolkit [18] to generate the is expected to be small. 2D coordinates as well as the corresponding depiction. Using a script similar to the following, we used Cinfony to Comparison of toolkits generate 2D depictions using OASA (the depiction library Cinfony makes it easy to compare the results obtained by used by pybel), the CDK and a development version of different toolkits for the same operations. This can be use- RDKit that all use the same 2D coordinates taken from the ful in identifying bugs, applying a test suite, or finding the SDF file: strengths and weaknesses of particular implementations. For example, where different toolkits calculate the same from cinfony import pybel, rdkit descriptors, if the calculated values are not highly corre- lated it may indicate a bug in one or the other. Earlier, we for toolkit in [rdkit, pybel]: mentioned that a difference in the treatment of implicit hydrogens causes different toolkits to give different values name = toolkit.__name__ for molecular weight unless hydrogens are explicitly added. Ensuring that a particular result is in agreement for mol in toolkit.readfile("sdf", with that obtained by another toolkit can act as a sanity "dataset.sdf"): check in such instances to avoid errors. mol.draw(filename = "%s_%s.png" % When carrying out the same operation with several (mol.title, name), toolkits, it is often convenient to iterate over the toolkits in an outer loop: show = False, from cinfony import obabel, rdkit, cdk usecoords = True) for toolkit in [obabel, rdkit, cdk]: When the resulting images were compared for the PubChem entry CID7250053, an error was found in the print toolkit.readstring("smi", depiction of the stereochemistry of an isopropyl group "CCC").molwt (Figure 2). Since the error only occurred in certain cases, it had not been previously noticed and would have been dif- As an example of how such comparisons can be used to ficult to identify without such a comparative study. Once identify bugs in toolkits, let us consider depiction. As a reported, the problem was quickly solved and the subse- dataset, we randomly chose 100 molecules from quent RDKit release depicted the stereochemistry cor- PubChem [17], with subsequent filtering to remove mul- rectly. A comparison of depictions by commercial toolkits Page 8 of 10 Chem. Cent. J. 2008, 2, 24. (page number not for citation purposes)
  • 23. Chemistry Central Journal 2008, 2:24 http://journal.chemistrycentral.com/content/2/1/24 Other requirements: OpenBabel, CDK, RDKit, Java, OASA, JPype, Python Imaging Library License: BSD Any restrictions to use by non-academics: None Competing interests The authors declare that they have no competing interests. Authors' contributions NMOB conceived and developed Cinfony. GRH is the lead developer of OpenBabel and created the Python and Java SWIG bindings. All authors read and approved the final manuscript. Additional material Additional file 1 Miniwebsite API. A mini-website of the Cinfony API documentation. Click here for file [http://www.biomedcentral.com/content/supplementary/1752- Figure different2toolkits Comparison of depictions of PubChem CID7250053 using 153X-2-24-S1.zip] Comparison of depictions of PubChem CID7250053 using different toolkits. The depiction using the develop- Additional file 2 ment version of RDKit showed incorrect stereochemistry Timing Code. A zip file containing Python, Java and C++ code used for for the isopropyl substituent of the thiazole ring. run time comparisons for two test cases. Click here for file [http://www.biomedcentral.com/content/supplementary/1752- 153X-2-24-S2.zip] and depictions generated by Cinfony is available here (see Additional file 3). Additional file 3 Miniwebsite Depictions. A mini-website showing a comparison of the Conclusion depictions generated by several cheminformatics toolkits. Cinfony makes it easy to combine complementary fea- Click here for file [http://www.biomedcentral.com/content/supplementary/1752- tures of the three main Open Source cheminformatics 153X-2-24-S3.zip] toolkits. By presenting a standard simplified API, the learning curve associated with starting to use a new toolkit is greatly reduced, thus encouraging users of one toolkit to investigate the potential of others. Acknowledgements Cinfony would not be possible without the work of many Open Source Cinfony is freely available from the Cinfony website [19], projects. In particular, we thank several developers who responded quickly both as Python source code and as a Windows distribu- to bug reports or queries: Beda Kosata (OASA), Greg Landrum (RDKit), tion containing dependencies. Installation instructions Tim Vandermeersch (OpenBabel), Steve Ménard (JPype). Thanks also to are provided for MacOSX, Linux and Windows. Gilbert Mueller and Chris Morley for feedback on installing Cinfony. NMOB thanks Google Code for providing free web hosting and develop- ment tools for Cinfony. We thank the anonymous reviewers for several Availability and requirements useful suggestions. Project name: Cinfony References Project home page: http://cinfony.googlecode.com 1. OpenBabel v.2.2.0 [http://openbabel.org] 2. Steinbeck C, Hoppe C, Kuhn S, Floris M, Guha R, Willighagen E: Operating system(s): Platform independent Recent Developments of the Chemistry Development Kit (CDK) – An Open-Source Java Library for Chemo- and Bio- informatics. Curr Pharm Des 2006, 12:2110-2120. Programming language: Python, Jython 3. Landrum G: RDKit. [http://www.rdkit.org]. 4. Murray-Rust P, Rzepa HS: Chemical Markup, XML, and the Worldwide Web. 1. Basic Principles. J Chem Inf Comput Sci 1999, 39:928-942. Page 9 of 10 Chem. Cent. J. 2008, 2, 24. (page number not for citation purposes)
  • 24. Chemistry Central Journal 2008, 2:24 http://journal.chemistrycentral.com/content/2/1/24 5. Apodaca R, O'Boyle N, Dalke A, Van Drie J, Ertl P, Hutchison G, James CA, Landrum G, Morley C, Willighagen E, De Winter H: OpenSMILES. [http://www.opensmiles.org]. 6. Daylight Chemical Information Systems Manual [http:// www.daylight.com/dayhtml/doc/theory/theory.smiles.html] 7. O'Boyle NM, Morley C, Hutchison GR: Pybel: a Python wrapper for the OpenBabel cheminformatics toolkit. Chem Cent J 2008, 2:5. 8. Kosata B: OASA. [http://bkchem.zirael.org/oasa_en.html]. 9. Raymond ES: The Art of UNIX Programming 2003 [http://www.catb.org/ ~esr/writings/taoup/index.html]. Reading, MA: Addison-Wesley 10. Symyx CTfile formats [http://www.mdli.com/downloads/public/ ctfile/ctfile.jsp] 11. KNIME – Konstanz Information Miner [http://knime.org] 12. SWIG v.1.3.36 [http://www.swig.org] 13. Ménard S: JPype. [http://jpype.sf.net]. 14. Boost.Python [http://www.boost.org/libs/python/doc/] 15. R development core team: R: A language and environment for statistical computing. [http://www.R-project.org]. 16. Irwin JJ, Shoichet BK: ZINC – A Free Database of Commercially Available Compounds for Virtual Screening. J Chem Inf Model 2005, 45:177-182. 17. PubChem [http://pubchem.ncbi.nlm.nih.gov/] 18. CACTVS Chemoinformatics Toolkit: Xemistry GmbH: Lah- ntal, Germany. . 19. O'Boyle NM: Cinfony. [http://cinfony.googlecode.com]. Publish with ChemistryCentral and every scientist can read your work free of charge Open access provides opportunities to our colleagues in other parts of the globe, by allowing anyone to view the content free of charge. W. Jeffery Hurst, The Hershey Company. available free of charge to the entire scientific community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours you keep the copyright Submit your manuscript here: http://www.chemistrycentral.com/manuscript/ Page 10 of 10 Chem. Cent. J. 2008, 2, 24. (page number not for citation purposes)
  • 25. O’Boyle et al. Journal of Cheminformatics 2011, 3:33 http://www.jcheminf.com/content/3/1/33 SOFTWARE Open Access Open Babel: An open chemical toolbox Noel M O’Boyle1, Michael Banck2, Craig A James3, Chris Morley4, Tim Vandermeersch4 and Geoffrey R Hutchison5* Abstract Background: A frequent problem in computational modeling is the interconversion of chemical structures between different formats. While standard interchange formats exist (for example, Chemical Markup Language) and de facto standards have arisen (for example, SMILES format), the need to interconvert formats is a continuing problem due to the multitude of different application areas for chemistry data, differences in the data stored by different formats (0D versus 3D, for example), and competition between software along with a lack of vendor- neutral formats. Results: We discuss, for the first time, Open Babel, an open-source chemical toolbox that speaks the many languages of chemical data. Open Babel version 2.3 interconverts over 110 formats. The need to represent such a wide variety of chemical and molecular data requires a library that implements a wide range of cheminformatics algorithms, from partial charge assignment and aromaticity detection, to bond order perception and canonicalization. We detail the implementation of Open Babel, describe key advances in the 2.3 release, and outline a variety of uses both in terms of software products and scientific research, including applications far beyond simple format interconversion. Conclusions: Open Babel presents a solution to the proliferation of multiple chemical file formats. In addition, it provides a variety of useful utilities from conformer searching and 2D depiction, to filtering, batch conversion, and substructure and similarity searching. For developers, it can be used as a programming library to handle chemical data in areas such as organic chemistry, drug design, materials science, and computational chemistry. It is freely available under an open-source license from http://openbabel.org. Introduction indication of biomolecular residues, or multiple The history of chemical informatics has included a huge conformations. variety of textual and computer representations of mole- While attempts have been made to provide a standard cular data. Such representations focus on specific atomic format for storing chemical data, including most notably or molecular information and may not attempt to store the development of Chemical Markup Language (CML) all possible chemical data. For example, line notations [2-6], an XML dialect, such formats have not yet like Daylight SMILES [1] do not offer coordinate infor- achieved widespread use. Consequently, a frequent pro- mation, while crystallographic or quantum mechanical blem in computational modeling is the interconversion formats frequently do not store chemical bonding data. of molecular structures between different formats, a pro- Hydrogen atoms are frequently omitted from x-ray crys- cess that involves extraction and interpretation of their tallography due to the difficulty in establishing coordi- chemical data and semantics. nates, and are often ignored by some file formats as the We outline for the first time, the development and use “implicit valence” of heavy atoms that indicates their of the Open Babel project, a full-featured open chemical presence. Other types of representations require specifi- toolbox, designed to “speak” the many different repre- cation of atom types on the basis of a specific valence sentations of chemical data. It allows anyone to search, bond model, inclusion of computed partial charges, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas. It provides both ready-to-use programs as well as * Correspondence: geoffh@pitt.edu 5 University of Pittsburgh, Department of Chemistry, 219 Parkman Avenue, a complete, extensible programmer’s toolkit for develop- Pittsburgh, PA 15217, USA ing cheminformatics software. It can handle reading, Full list of author information is available at the end of the article © 2011 O’Boyle et al; licensee Chemistry Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. J. Cheminf. 2011, 3, 33.
  • 26. O’Boyle et al. Journal of Cheminformatics 2011, 3:33 Page 2 of 14 http://www.jcheminf.com/content/3/1/33 writing, and interconverting over 110 chemical file for- substructure searching (see below); the MolPrint2D and mats, supports filtering and searching molecule files Multilevel Neighborhoods of Atoms formats calculate cir- using Daylight SMARTS pattern matching [7] and other cular fingerprints defined by Bender et al. [15,16] and methods, and provides extensible fingerprinting and Filimonov et al. [17,18] respectively. molecular mechanics frameworks. We will discuss the Each format can have multiple options to control frameworks for file format interconversion, fingerprint- either reading or writing a particular format. For exam- ing, fast molecular searching, bond perception and atom ple, the InChI format has 12 options including an typing, canonical numbering of molecular structures and option “K” to generate an InChIKey, “T <param>“ to fragments, molecular mechanics force fields, and the truncate the InChI depending on a supplied parameter extensible interfaces provided by the software library to and “w” to ignore certain InChI warnings. The available enable further chemistry software development. options are listed in the documentation, are shown in Open Babel has its origin in a version of OELib the Graphical User Interface (GUI) as checkboxes or released as open-source software by OpenEye Scientific textboxes, and can be listed at the command-line. In under the GPL (GNU Public License). In 2001, OpenEye fact, all three are generated from the same source; a decided to rewrite OELib in-house as the proprietary documentation string in the C++ code. OEChem library, so the existing code from OELib was spun out into the new Open Babel project. Since 2001, Fingerprints and Fast Searching Open Babel has been developed and substantially Databases are widely used to store chemical information extended as an international collaborative project using especially in the pharmaceutical industry. A key require- an open-source development model [8]. It has over ment of such a database is the ability to index chemical 160,000 downloads, over 400 citations [9], is used by structures so that they can be quickly retrieved given a over 40 software projects [10], and is freely available query substructure. Open Babel provides this functional- from the Open Babel website [11]. ity using a path-based fingerprint. This fingerprint, referred to as FP2 in Open Babel, identifies all linear Features and ring substructures in the molecule of lengths 1 to 7 File Format Support (excluding the 1-atom substructures C and N) and maps With the release of Open Babel 2.3, Open Babel sup- them onto a bit-string of length 1024 using a hash func- ports 111 chemical file formats in total. It can read 82 tion. If a query molecule is a substructure of a target formats and write 85 formats. These encompass com- molecule, then all of the bits set in the query molecule mon formats used in cheminformatics (SMILES, InChI, will also be set in the target molecule. The fingerprints MOL, MOL2), input and output files from a variety of for two molecules can also be used to calculate struc- computational chemistry packages (GAMESS, Gaussian, tural similarity using the Tanimoto coefficient, the num- MOPAC), crystallographic file formats (CIF, ShelX), ber of bits in common divided by the union of the bits reaction formats (MDL RXN), file formats used by set. molecular dynamics and docking packages (AutoDock, Clearly, repeated searching of the same set of mole- Amber), formats used by 2D drawing packages (Chem- cules will involve repeated use of the same set of finger- Draw), 3D viewers (Chem3D, Molden) and chemical prints. To avoid the need to recalculate the fingerprints kinetics and thermodynamics (ChemKin, Thermo). For- for a particular multi-molecule file (such as an SDF file), mats are implemented as “plugins” in Open Babel, Open Babel provides a fastindex format that solely which makes it easy for users to contribute new file for- stores a fingerprint along with an index into the original mats (see Extensible Interface below). Depending on the file. This index leads to a rapid increase in the speed of format, other data is extracted by Open Babel in addi- searching for matches to a query - datasets with several tion to the molecular structure; for example, vibrational million molecules are easily searched interactively. In frequencies are extracted from computational chemistry this way, a multi-molecule file may be used as a light- log files, unit cell information is extracted from CIF weight alternative to a chemical database system. files, and property fields are read from SDF files. A number of “utility” file formats are also defined; Bond Perception and Atom Typing these are not strictly speaking a way of storing the As mentioned above, many chemical file formats offer molecular structure, but rather present certain function- representations of molecular data solely as lists of ality through the same interface as the regular file for- atoms. For example, most quantum chemical software mats. For example, the report format is a write-only packages and most crystallographic file formats do not utility format [12] that presents a summary of the mole- offer definitions of bonding. A similar situation occurs cular structure of a molecule; the fingerprint format [13] in the case of the Protein Data Bank (PDB) format; and fastsearch format [14] are used for similarity and while standardized [19] files contain connectivity J. Cheminf. 2011, 3, 33.
  • 27. O’Boyle et al. Journal of Cheminformatics 2011, 3:33 Page 3 of 14 http://www.jcheminf.com/content/3/1/33 information, non-standard files exist that often do not determined, an exhaustive search is performed to assign provide full connectivity information. Consequently, single and double bonds to satisfy all valences in a Open Babel features methods to determine bond con- Kekulé form. Since this process is exponential in com- nectivity, bond order perception, aromaticity determina- plexity, the algorithm will terminate if more than 30 tion, and atom typing. levels of recursion or 15 seconds are exceeded (which Bond connectivity is determined by the frequently may occur in the case of large fused ring systems such used algorithm of detecting atoms closer than the sum as carbon nanotubes). of their covalent radii, with a slight tolerance (0.45 Å) to allow for longer than typical bonds. To handle disorder Canonical Representation of Molecules in crystallographic data (e.g., PDB or CIF files), atoms In general, for any particular molecular structure and closer than 0.63 Å are not bonded. A further filtering file format, there are a large number of possible ways pass is made to ensure standard bond valency is main- the structure could be stored; for example, there are N! tained; each element has a maximum number of bonds, ways of ordering the atoms in an MOL file. While each if this is exceeded then the longest bonds to an atom of the orderings encodes exactly the same information, are successively removed until the valence rule is it can be useful to define a canonical numbering of the fulfilled. atoms of a molecule and use this to derive a canonical After bond connectivity is determined, if needed or representation of a molecule for a particular file format. requested by the user, bond order perception is per- For a zero-dimensional file format without coordinates, formed on the basis of bond angles and geometries. The such as SMILES, the canonical representation could be method is similar to that proposed by Roger Sayle [20] used to index a database, remove duplicates or search and uses the average bond angle around an un-typed for matches. atom to determine sp and sp 2 hybridized centers. 5- Open Babel implements a sophisticated canonicaliza- membered and 6-membered rings are checked for pla- tion algorithm that can handle molecules or molecular narity to estimate aromaticity. Finally, atoms marked as fragments. The atom symmetry classes are the initial unsaturated are checked for an unsaturated neighbor to graph invariants and encode topological and chemical give a double or triple bond. After this initial atom typ- properties. A cooperative labeling procedure is used to ing, known functional groups are matched, followed by investigate the automorphic permutations to find the aromatic rings, followed by remaining unsatisfied bonds canonical code. Although the algorithm is similar to the based on a set of heuristics for short bonds, atomic elec- original Morgan canonical code [21], various improve- tronegativity, and ring membership. ments are implemented to improve performance. Most Atom typing is performed by “lazy evaluation,” match- notably, the algorithm implements heuristics from the ing atoms against SMARTS patterns to determine hybri- popular nauty package [22,23]. Another aspect handled dization, implicit valence, and external atom types. by the canonical code is stereochemistry as different Atom type perception may be triggered by adding labelings can lead to different parities. This is further hydrogens (which requires determination of implicit and complicated by the possibility of symmetry-equivalent explicit valence), exporting to a file format that requires stereocenters and stereocenters whose configuration is atom types, or as requested by the user. To minimize interdependent. The full details will be the subject of a the amount of typing required, when importing from a separate publication. format with atom types specified, a lookup table is used to translate between equivalent types. Coordinate Generation in 2D and 3D An important part of atom typing is aromaticity detec- Open Babel, version 2.3, has support for 2D coordinate tion and assignment of Kekulé bond orders (kekuliza- generation (Figure 1) through the donation of code by tion). In Open Babel, a central aromaticity model is Sergei Trepalin, based on the code used in the MCDL used, largely matching the commonly used Daylight chemical structure editor [24-26]. The MCDL algorithm SMILES representation [1], but with added support for aims to layout the molecular structure in 2D such that aromatic phosphorous and selenium. Potential aromatic all bond lengths are equal and all bond angles are close atoms and bonds are flagged on the basis of member- to 120°. The layout algorithm includes a small database ship in a ring system possibly containing 4n+2 π elec- of around 150 templates to help layout cages and large trons. Aromaticity is established only if a well-defined fragment cycles. To deal with the problem of overlap- valence bond Kekulé pattern can be determined. To do ping fragments, the algorithm includes an exhaustive this, atoms are added to a ring system and checked search procedure that rotates around acyclic bonds by against the 4n+2 π electron configuration, gradually 180°. increasing the size to establish the largest possible con- Coordinate generation in 3D was introduced in Open nected aromatic ring system. Once this ring system is Babel version 2.2, and improved in version 2.3, to enable J. Cheminf. 2011, 3, 33.
  • 28. O’Boyle et al. Journal of Cheminformatics 2011, 3:33 Page 4 of 14 http://www.jcheminf.com/content/3/1/33 tetrahedral stereochemistry and square-planar stereo- chemistry (this last is still under development), as well as perception routines for 2D and 3D geometries, and routines to query and alter the stereochemistry. The detection of stereogenic units starts with an ana- lysis of the graph symmetry of the molecule to identify the symmetry class of each atom. However, given that a complete symmetry analysis also needs to take stereo- chemistry into account, this means that the overall stereochemistry can only be found iteratively. At each iteration, the current atom symmetry classes are used to identify stereogenic units. For example, a tetrahedral Figure 1 Interconversion of 0D, 2D and 3D structures. The structures shown are of sertraline, a selective serotonin reuptake center is identified as chiral if it has four neighbors with inhibitor (SSRI) used in the treatment of depression. A SMILES string different symmetry classes (or three, in the case where a for sertraline is shown at the top; this can be considered a 0D lone pair gives rise to the tetrahedral shape). structure (only connectivity and stereochemical information). From this, Open Babel can generate a 2D structure (bottom left, depicted Forcefields by Open Babel) or a 3D structure (bottom right, depicted by Avogadro), and all of these can be interconverted. Molecular mechanics functions are provided for use with small molecules. Typical applications include energy evaluation or minimization, alone or as part of a conversion from 0D formats such as SMILES to 3D for- larger workflow. The selection of implemented force mats such as SDF (Figure 1). The 3D structure genera- fields allows most molecular structures to be used and tor builds linear components from scratch following parameters to be assigned automatically. The MMFF94 geometrical rules based on the hybridization of the (s) force field can be used for organic or drug-like mole- atoms. Single-conformer ring templates are used for cules [27-31]. For molecules containing any element of ring systems. The template matching algorithm iterates the periodic table or complex geometry (i.e. not sup- through the templates from largest to smallest searching ported by MMFF94), the UFF force field can be used for matches. If a match is found, the algorithm con- instead [33]. Recently, code implementing the GAFF tinues but will not match any ring atoms previously force field [34,35] was also contributed and released as templated except in the case of a single overlap (the two part of version 2.3. All of the forcefields allow the appli- ring systems of a spiro group) or an overlap involving cation of constraints on particular atom positions, or exactly two adjacent atoms (two fused ring systems). particular distances. After an initial structure is generated, the stereochemis- Several conformer searching methods have been try (cis/trans and tetrahedral) is corrected to match the implemented using the forcefields, all based on the “tor- input structure. Finally, the energy of the structure is sion-driving” approach. This approach involves setting minimized using the MMFF94 forcefield [27-31] and a torsion angles from a set of predefined allowed values low energy conformer found using a weighted rotor for a particular rotatable bond. The most thorough search. search method implemented is a systematic search While the 3D structure builder produces reasonable method, which iterates over all of the allowed torsion conformations for molecules without rings or with ring angles for each rotatable bond in the molecule and systems for which a template exists, the results may be retains the conformer with the lowest energy. Since a poor for molecules with more complex ring systems or systematic search may not be feasible for a molecule organometallic species. Future work will be performed with multiple rotatable bonds, a number of stochastic to compare the results of Open Babel with other pro- search methods are also available: the random search grams with respect to both speed and the quality of the method, which tries random settings for the torsion generated structures [32]. angles (from the predefined allowed values), and a weighted rotor search, a stochastic search method that Stereochemistry converges on a low energy conformer by weighting par- A recent focus of Open Babel development has been to ticular torsion angles based on the relative energy of the ensure robust translation of stereochemical information generated conformer. With Open Babel 2.3, conformer between file formats. This is particularly important search based on a genetic algorithm is also available when dealing with 0D formats as these explicitly encode which allows the application of filters (e.g. a diversity fil- the perceived stereochemistry. Open Babel 2.3 includes ter) and different scoring functions. This latter method classes to handle cis/trans double bond stereochemistry, can be used to generate a library of diverse conformers, J. Cheminf. 2011, 3, 33.
  • 29. O’Boyle et al. Journal of Cheminformatics 2011, 3:33 Page 5 of 14 http://www.jcheminf.com/content/3/1/33 or like the other methods to seek a low energy confor- mer [36]. Implementation Technical Details Open Babel is implemented in standards-compliant C+ +. This ensures support for a wide variety of C++ com- pilers (MSVC, GCC, Intel Compiler, MinGW, Clang), operating systems (Windows, Mac OS X, Linux, BSD, Windows/Cygwin) and platforms (32-bit, 64-bit). Since version 2.3, it is compiled using the CMake build system [37,38]. This is an open-source cross-platform build sys- tem with advanced features for dependency analysis. The build system has an associated unit test framework CTest, which allows nightly builds to be compiled and tested automatically with the results collated and dis- played on a centralized dashboard [39]. Figure 2 Architecture of the Open Babel codebase. To simplify installation Open Babel has as few exter- nal dependencies as possible. Where such dependencies exist, they are optional. For example, if the XML devel- The code base can be considered as consisting of the opment libraries are not available, Open Babel will still following modules (Figure 2): compile successfully but none of the XML formats (such as Chemical Markup Language, CML) will be • The Chemical Core, which contains OBMol etc. available. Similarly, if the Eigen matrix and linear alge- and has all of the chemical structure description and bra library is not found, any classes that require fast manipulation. This is the heart of the application matrix manipulation (such as OBAlign, which performs and its API can be used as a chemical toolbox. It least squares alignment) will not be compiled. has no input/output capabilities. While the majority of the Open Babel library is writ- • The Formats, which read and write to files of dif- ten in C++, bindings have been developed for a range of ferent types. These classes are derived from a com- other programming languages, including Java and the . mon base class, OBFormat, which is in the NET platform, as well as the so-called “dynamic” script- Conversion Control module. They also make use of ing languages Perl, Python, and Ruby. These are auto- the chemical routines in the Chemical Core module. matically generated from the C++ header files using the Each format file contains a global object of the for- SWIG tool. As described previously [40], in the case of mat class. When the format is loaded the class con- Python an additional module is provided named Pybel structor registers the presence of the class with that simplifies access to the C++ bindings. These inter- OBConversion. This means that the formats are plu- faces facilitate development of web-enabled chemistry gins - new formats can be added without changing applications, as well as rapid development and any framework code. prototyping. • Common Formats include OBMoleculeFormat and XMLBaseFormat from which most other formats Code Architecture (like Format A and Format B in the diagram) are The Open Babel codebase has a modular design as derived. Independent formats like Format C are also shown in Figure 2. The goal of this design is threefold: possible. • The Conversion Control, which also keeps track of 1. To separate the chemistry, the conversion process the available formats, the conversion options and the and the user interfaces reducing, as far as possible, input and output streams. It can be compiled with- the dependency of one upon another. out reference to any other parts of the program. In 2. To put all of the code for each chemical format in particular, it knows nothing of the Chemical Core: one place (usually a single file) and make the addi- mol.h is not included. tion of new formats simple. • The User Interface, which may be a command line 3. To allow the format conversion of not just mole- application, a Graphical User Interface (GUI), or cules, but also any other chemical objects, such as may be part of another program that uses Open reactions. Babel’s input and output facilities. This depends only J. Cheminf. 2011, 3, 33.
  • 30. O’Boyle et al. Journal of Cheminformatics 2011, 3:33 Page 6 of 14 http://www.jcheminf.com/content/3/1/33 on the Conversion Control module (obconversion.h citation in science. The rights granted by open source is included), but not on the Chemical Core or on licenses largely coincide with the norms of scientific any of the Formats. ethics to enable verifiability, repeatability, and building • The Fingerprint API, as well as being usable in on previous results and theories. external programs, is employed by the fastsearch and Beyond these rights, Open Babel (like most other fingerprint formats. open-source projects) offers open development – that is, • The Fingerprints, which are bit arrays that describe all development occurs in public forums and with public an object and which facilitate fast searching. They code repositories. This results in greater input from the are also built as plugins, registering themselves with community as any user can easily submit bug reports or their base class OBFingerprint which is in the Fin- feature suggestions, get involved in discussions on the gerprint API. future direction of Open Babel or even become a devel- • Other features such as Forcefields, Partial Charge oper him/herself. In practice, the number of active con- Models and Chemical Descriptors, although not tributors has increased over time through this level of shown in the diagram, are handled similarly to open, public development (Figure 3). Moreover, it Fingerprints. means that the development of the code is completely • The Error Handling can be used throughout the transparent and the quality of the software is available program to log and display errors and warnings. for public scrutiny. Indeed, since its inception, over 658 bugs have been submitted to the public tracker and fixed [41]. Extensible Interface The utility of software libraries such as Open Babel Validation and Testing depends on the ability of the design to be extended over Open Babel includes an extensive test suite comprising time to support new functionality. To facilitate this, 60 different test programs each with tens to hundreds of Open Babel implements a plugin interface for file for- tests. In early 2010, a nightly build infrastructure and mats, fingerprints, charge models, descriptors, “opera- dashboard was put in place with support from Kitware, tors” and molecular mechanics force fields. This ensures Inc. This has greatly improved code quality by catching a clean separation of the implementation of a particular regressions, and also ensures that the code compiles plugin from the core Open Babel library code, and cleanly on all platforms and compilers supported by makes it easy for a new plugin (e.g. a new file format) to Open Babel. Some examples of tests that are run each be contributed; all that is needed is a single C++ file and night are: a trivial change to one of the build files. The operator plugins provide a very general mechanism for operating (1) The MMFF94 forcefield code is tested against the on a molecule (e.g. energy minimization or 3D coordi- MMFF94 validation suite. nate generation) or on a list of molecules (e.g. filtering or sorting) after reading but before writing. Plugins are dynamically loaded at runtime. This decreases the overall disk and memory footprint of Open Babel, allowing external developers to choose par- ticular functionality needed for their application and ignore other, less relevant features. It also allows the possibility of a third-party distributing plugins separately to the Open Babel distribution to provide additional functionality. Open-Source License and Open Development Open Babel is open-source software, which offers end users and third-party developers a range of additional rights not granted by proprietary chemistry software. Open-source software, at its most basic level, grants users the rights to study how their software works, to adapt it for any purpose or otherwise modify it, and to Figure 3 Number of contributors over time. Note that this graph share the software and their modifications with others. only includes developers who directly commited code to the Open In this sense, Open Source functions in similar ways to Babel source code repository, and does not include patches provided by users. the processes of open peer review, publication, and J. Cheminf. 2011, 3, 33.
  • 31. O’Boyle et al. Journal of Cheminformatics 2011, 3:33 Page 7 of 14 http://www.jcheminf.com/content/3/1/33 (2) The OBAlign class, which was developed using Test-Driven Development (TDD) methodology, is run against its test suite. (3) Handling of symmetry is validated by converting several test cases between SMILES, 2D and 3D SDF, and InChI (there are also several test programs with unit tests for the individual stereo classes in the API). (4) The SMARTS parser is tested using over 250 Figure 4 The two failures found in the validation test for reading/writing SMILES. valid and invalid SMARTS patterns, and the SMARTS matcher is tested using 125 basic SMARTS patterns. (5) The LSSR (Least Set of Smallest Rings) code is meso compound and so both SMILES strings are cor- tested for invariance against changing the atom rect and represent the same molecule. However the order for a series of polycyclic molecules. canonicalization algorithm should have chosen one stereochemistry or the other for the canonical Recently the development team has placed a major representation. focus on increasing the robustness of file format transla- Another area of focus was the canonicalization algo- tion particularly in relation to the commonly used rithm, which can be used to generate canonical SMILES SMILES and MDL Molfile formats. Translating between as well as other formats. The algorithm can be tested by these formats requires accurate stereochemistry percep- ensuring that the same canonical SMILES string is tion, inference of implicit hydrogens, and kekulization of obtained even when the order of atoms in a molecule is delocalized systems. While it is difficult to ensure that changed (while retaining the same connection table). any complex piece of code is free of bugs, and Open The test stresses all areas of the library, including aro- Babel is no exception, validation procedures can be car- maticity perception, kekulization, stereochemistry, and ried out to assess the current level of performance and canonicalization. The development of the canonicaliza- to find additional test cases that expose bugs. The fol- tion code in Open Babel was guided by applying this lowing procedure was used to guide the rewriting of test to the 5,151,179 molecules in the eMolecules catalo- stereochemistry code in Open Babel, a project that gue (dated 2011-01-02) with 10 random shuffles of the began in early 2009. Starting with a dataset of 18,084 atom order. At the time of the Open Babel 2.2.3 release, 3D structures from PubChem3D as an SDF file, we there were 24,404 failures of the canonicalization algo- compared the result of (a) conversion to SMILES, fol- rithm; this has now been reduced to only four (shown lowed by conversion of that to Canonical SMILES to (b) in Figure 5, < 0.001%). The Open Babel nightly test conversion directly to Canonical SMILES. This proce- suite ensures that this test passes for a number of pro- dure can be used to flush out errors in reading the ori- blematic molecules. Although the canonicalization algo- ginal SDF file, reading/writing SMILES (either due to rithm is still not perfect, we believe that the current stereochemistry errors or kekulization problems), and is level of performance (99.99992% success on the eMole- also a test (to some extent) of the canonicalization code. cules catalogue) is acceptable for general use and with At the time of starting this work (March 2009), the time we intend to improve performance further. error rate found was 1424 (8%); by Oct 2009, combined Given that the error rate for canonicalization and work on stereochemistry, kekulization and canonicaliza- handling of stereochemistry is now quite low, the next tion had reduced this to 190 (~1%), and continued area of focus for the Open Babel development team is improvements have reduced the number of errors down to improve the handling of implicit valence for “unusual to two (shown in Figure 4) for Open Babel 2.3.1 atoms.” This is particularly important for organometallic (~0.01%). The first failure is due to a kekulization error species and inorganic complexes. in a polycyclic aromatic molecule incorporating heteroa- toms: (a) gave c1ccc2c(c1)c1[nH][nH]c3c4c1c(c2) Using Open Babel ccc4cc1c3cccc1 while (b) gave c1ccc2c(c1)c1nnc3c4c1c Applications (c2)ccc4cc1c3cccc1. This error led to confusion over The Open Babel package is composed of a set of user whether or not the aromatic nitrogens have hydrogens applications as well as a programming library. The main attached (they do not). The second failure involves con- command line application provided is obabel (a small fusion over the canonical stereochemistry at a bridge- upgrade on the earlier babel), which facilitates file for- head carbon: (a) gave C1CN2[C@@H](C1)CCC2 while mat conversion, filtering (by SMARTS, title, descriptor (b) gave C1CN2[C@H](C1)CCC2. This is actually a value, or property field), 3D or 2D structure generation, J. Cheminf. 2011, 3, 33.
  • 32. O’Boyle et al. Journal of Cheminformatics 2011, 3:33 Page 8 of 14 http://www.jcheminf.com/content/3/1/33 Figure 5 The four failures found in the validation test for canonicalization. conversion of hydrogens from implicit to explicit (and use in programs. Documentation on the complete API vice versa), and removal of small fragments or of dupli- (generated using Doxygen [42]) is available from the cate structures. A number of features are provided to Open Babel website [43], or can be generated from the handle multi-molecule file formats (such as SDF or source code. MOL2) and to use or manipulate the information in The functionality provided by the Open Babel library property fields and molecule titles. Here is an example is relied upon by many users and by several other soft- of using obabel to convert from SDF format to SMILES: ware projects, with the result that introducing changes obabel inputmols.sdf -O outputmols.smi to the API would cause existing software to break. For A more complicated use would be to extract all mole- this reason, Open Babel strives to maintain API stabi- cules in an SDF file whose titles start with “active": lity over long periods of time, so that existing software obabel inputmols.sdf -aT -o copy -O out- will continue to work despite the release of new Open putmols.sdf –filter “title=’active*’” Babel versions with additional features, file formats The copy format specified by “-o copy” is a utility for- and bug fixes. Open Babel uses a version numbering mat that copies the exact contents of the input file (for system that indicates how the API has changed with the filtered molecules) directly to the output, without every release: perception or interpretation. The “-aT” indicates that only the title of the input SDF file should be read; full • Bug fix releases (e.g. 2.0.0 versus 2.0.1) do not chemical perception is not required. change API at all The Open Babel graphical user interface (GUI) pro- • Minor version releases (e.g. 2.0 versus 2.1) will add vides the same functionality. Figure 6 is a screenshot of to the API, but will otherwise be backwards- the GUI carrying out the same filtering operation compatible described in the obabel example above. The left panel • Major version releases (e.g. 2 versus 3) are not deals with setting up the input file, the right panel han- backwards-compatible, and have changes to the API dles the output and the central panel is for setting con- (including removal of deprecated classes and version options. Depending on whether a particular functions) option requires a parameter, the available options are displayed either as check boxes or as text entry boxes. Figure 7 shows an example C++ program that uses the These interface elements are generated dynamically two main classes OBConversion and OBMol to print directly from the text description and help text provided out the molecular weight of all of the molecules in an by each format plugin. SDF file. This could be used, for example, to investigate differences in the molecular weight distribution between Programming Library two databases. The same program is shown in Figure 8 The Open Babel library allows users to write chemistry but implemented using the Python bindings. applications without worrying about the low-level details of handling chemical information, such as how to read Examples of Use or write a particular file format, or how to use SMARTS Open Babel has already been referenced over 400 times for substructure searching. Instead, the user can focus for various uses. The most common use of Open Babel on the scientific problem at hand, or on creating a more is through the obabel command line application (or the easy-to-use interface (e.g. a GUI) to some of Open corresponding graphical user interface) for the intercon- Babel’s functionality. The Open Babel API (Application version of chemical file formats. Such conversions may Programming Interface) is the set of classes, methods also involve the calculation or inference of additional and variables provided by Open Babel to the user for molecular information or application of a filter. Some J. Cheminf. 2011, 3, 33.
  • 33. O’Boyle et al. Journal of Cheminformatics 2011, 3:33 Page 9 of 14 http://www.jcheminf.com/content/3/1/33 Figure 6 Screenshot of the Open Babel GUI. In the screenshot, the Open Babel GUI is running on Bio-Linux 6.0, an Ubuntu derivative. published examples of these include the following: • calculation of partial charges [54,55] • generation of molecular fingerprints [56-59] • interconversion of chemical file formats or repre- • removal of duplicate molecules from a dataset [60] sentations [44-47] • calculation of MOL2 atom types [61] • addition of hydrogens [48-50] • generation of 3D molecular structures [51-53] An interesting example that shows how a particular chemical representation may be used to facilitate a scientific study is the crystallographic study of Fábián and Brock who used Open Babel to generate InChI strings for molecules in the Cambridge Structural Data- base [62]. Exploiting the fact that InChIs of enantiomers are identical expect at the enantiomer sublayer ("/m0” Figure 7 Example C++ program that uses the Open Babel Figure 8 Example Python program that uses the Open Babel library. The program prints out the molecular weight of each library. The program prints out the molecular weight of each molecule in the SDF file “dataset.sdf”. molecule in the SDF file “dataset.sdf”. J. Cheminf. 2011, 3, 33.
  • 34. O’Boyle et al. Journal of Cheminformatics 2011, 3:33 Page 10 of 14 http://www.jcheminf.com/content/3/1/33 Table 1 Software applications and libraries that use Open Babel Name Description Reference Web page Avogadro GUI for molecular modelling and computational chemistry G. Hutchison http://avogadro.openmolecules.net/ M. Hanwell cclib Parse computational chemistry output files [72] http://cclib.sf.net/ CCP1GUI GUI for computational chemistry Jens Thomas http://www.cse.scitech.ac.uk/ccg/software/ ccp1gui ChemAzTech Manage a chemical laboratory database Rémy Dernat http://chemaztech.sf.net/ ChemSpotlight Chemistry file indexer for MacOSX G. Hutchison http://chemspotlight.openmolecules.net/ ChemT GUI for generating combinatorial libraries Rui Abreu http://www.esa.ipb.pt/~ruiabreu/chemt ChemTool 2D molecular drawing [73] http://ruby.chemie.uni-freiburg.de/~martin/ chemtool CMDF Library for handling and preparing multi-scale multi-paradigm [74] http://web.mit.edu/mbuehler/www/research/ simulations CMDF/CMDF.htm Confab Systematically generate conformers [36] http://confab.googlecode.com/ DockoMatic Automate the preparation and analysis of AutoDock runs [75] http://sf.net/projects/dockomatic/ DOVIS 2.0 Automate the preparation and analysis of AutoDock runs [76] http://www.bhsai.org/dovis.html FAF-Drugs2 ADMET filtering of molecular datasets [77] http://www.mti.univ-paris-diderot.fr/fr/ downloads.html FMiner2 Large-scale chemical graph mining based on backbone [78,79] http://www.maunz.de/wordpress/bbrc refinement classes Ghemical GUI for computational chemistry Tommi http://www.uku.fi/~thassine/projects/ Hassinen ghemical Gnome 2D chemical editor, 3D viewer, chemical calculator and periodic Jean Bréfort http://gchemutils.nongnu.org/ Chemistry Utils table for Linux iBabel MacOSX interface to Open Babel and other Open chemistry tools Chris Swain http://homepage.mac.com/swain/Sites/ Macinchem/page65/ibabel3.html Kalzium GUI showing information on the periodic table of the elements Carsten http://edu.kde.org/kalzium/ Niehaus Lazar Lazy Structure-Activity Relationships for toxicity prediction [80] http://www.in-silico.de/software/ Molekel GUI for computational chemistry Ugo Varetto http://molekel.cscs.ch/ molsKetch 2D chemical editor Harm van http://molsketch.sf.net/ Eersel MyChem Chemistry extension to the MySQL database J. Pansanel http://mychem.sf.net/ NanoEngineer- Computer-aided design for the nanoscale Nanorex, Inc. http://nanoengineer-1.net/ 1 NanoHive-1 Simulator for the study, experimentation, and development of Brian Helfrich http://www.nanohive-1.org/ nanotech entities OpenMD Open Source molecular dynamics engine [81] http://openmd.net/ Open3DQSAR High-throughput [82,83] http://www.open3dqsar.org/ chemometric analysis of molecular interaction fields OSRA Extracts chemical structures from images [84] http://osra.sf.net/ PgChem Chemistry extension to the PostgreSQL database Ernst-Georg http://pgfoundry.org/projects/pgchem Schmidt Pharao Pharmacophore discovery and searching Silicos NV http://www.silicos.be/ Pharmer Pharmacophore searching [85] http://smoothdock.ccbb.pitt.edu/pharmer Piramid Shape-based alignment of molecules Silicos NV http://www.silicos.be/ PyADF Library for handling and preparing quantum mechanical multi- [86] http://www.ipc.kit.edu/cfn-ysg/158.php scale simulations PyRx GUI for virtual screening with protein-ligand docking Sargis http://pyrx.scripps.edu/ Dallakyan QMForge GUI for analysing results of quantum chemistry calculations [72] http://qmforge.sf.net/ RMG Reaction Mechanism Generator [87] http://rmg.sf.net/ Sci3D Interactive visualization of 3D models of scientific data, such as T.J. O’Donnell http://sci3d.sf.net/ molecular structures and surfaces Sieve Filter molecules from datasets Silicos NV http://www.silicos.be/ SMIREP Generation of fragment-based structure-activity relationships [88] http://www.karwath.org/systems/smirep.html Stripper Extract molecular scaffolds Silicos NV http://www.silicos.be/ J. Cheminf. 2011, 3, 33.
  • 35. O’Boyle et al. Journal of Cheminformatics 2011, 3:33 Page 11 of 14 http://www.jcheminf.com/content/3/1/33 Table 1 Software applications and libraries that use Open Babel (Continued) Toxtree Toxic hazard estimation using decision trees Ideaconsult http://toxtree.sf.net/ Ltd. V_Sim Visualize atomic structures such as crystals and grain boundaries Damien Caliste http://inac.cea.fr/L_Sim/V_Sim/index.en.html WebBabel Web application for file format conversion T.J. O’Donnell http://webbabel.sf.net/ XDrawChem 2D molecular editor Bryan Herger http://xdrawchem.sf.net/ XtalOpt Extension to Avogadro for crystal-structure prediction [89] http://xtalopt.openmolecules.net/ YASARA GUI for molecular graphics, modeling and simulation Elmar Krieger http://www.yasara.org/ ZODIAC GUI for molecular modelling and docking [90] http://www.zeden.org/ or “/m1”), they used the InChIs as part of a workflow to • Langham and Jain developed a model for chemical identify kryptoracemates (a class of racemic crystals mutagenicity based on atom pair features [64]. where the enantiomers are not related by space-group • Fontaine et al. implemented a method, anchor- symmetry) in the database. GRIND, that uses an anchor point of a molecular To implement new methods, or access additional mole- scaffold to compare molecular interaction fields cular information, it is necessary to use the Open Babel when different substituents are present [65]. library directly either from C++ or using one of the sup- • Konyk et al. have developed a plugin for Open ported language bindings. Some examples of published Babel that adds support for the Web Ontology Lan- studies that have done this include the following: guage (OWL) to allow automated reasoning about chemical structures [66]. • Dehmer et al. implemented molecular complexity • Kogej et al. (AstraZeneca) implemented a 3-point measures based on information theory [63]. pharmacophore fingerprint called TRUST [67]. Table 2 Web applications and databases that use Open Babel Name Description Reference Web page ChemDB Database of small molecules [91] http://cdb.ics.uci.edu/ Cheméo Chemical structure and property search engine Céondo Ltd http://www.chemeo.com/ ChemMine Web application for analysing and clustering small molecules [92] http://chemmine.ucr.edu/ Tools eMolecules Chemical vendor search engine eMolecules. http://emolecules.com/ com FragmentStore Database for comparison of fragments found in metabolites, drugs [93] http://bioinf-applied.charite.de/ and toxic compounds fragment_store/ Frog2 FRee Online druG 3D conformation generation [94] http://bioserv.rpbs.univ-paris-diderot.fr/cgi- bin/Frog2 hBar Lab Web application providing on-demand access to computer-aided hBar Solutions https://www.hbar-lab.com/ chemistry ApS IUPHAR-DB Database of human drug targets and their ligands [95] http://www.iuphar-db.org/ OpenCDLig Web application for sharing resources about cyclodextrin/ligand [96] https://kdd.di.unito.it/casmedchem/ complexes PSMDB Protein - Small-Molecule Database [97] http://compbio.cs.toronto.edu/psmdb/ SambVca Web application for calculation of buried volume of organometallic [98] https://www.molnac.unisa.it/OMtools/ ligands sambvca.php ScafBank Database of molecular scaffolds [99] http://202.127.30.184:8080/scafbank.html SMARTCyp Web application for prediction of sites of cytochrome P450 [100] http://www.farma.ku.dk/smartcyp/ mediated metabolism sMol Explorer Web application for exploring small-molecule datasets [101] http://www3a.biotec.or.th/isl/index.php/smol- explorer SuperImposé Web application for structural similarity between ligands, binding [102] http://farnsworth.charite.de/superimpose- sites or proteins web/ SuperToxic Database of toxic compounds [103] http://bioinformatics.charite.de/supertoxic/ SuperSite Detailed information on, and comparisons of, protein-ligand [104] http://bioinf-tomcat.charite.de/supersite/ binding sites SuperSweet Database of natural and artificial sweeteners [105] http://bioinf-applied.charite.de/sweet/ STITCH2 Chemical-protein interactions [106] http://stitch.embl.de/ VCCLAB Virtual Computational Chemistry Laboratory [107] http://www.vcclab.org/ wwLigCSRre Web application that performs ligand-based screening using 3D [108] http://bioserv.rpbs.univ-paris-diderot.fr/Help/ similarity wwLigCSRre.html J. Cheminf. 2011, 3, 33.
  • 36. O’Boyle et al. Journal of Cheminformatics 2011, 3:33 Page 12 of 14 http://www.jcheminf.com/content/3/1/33 • Many other examples exist [68-71]. Any restrictions to use by non-academics: None The vital role that a cheminformatics toolkit plays in Acknowledgements and Funding the development of scientific resources is shown by We would like to thank all users and contributors to the Open Babel project Tables 1 and 2. Table 1 lists examples of stand-alone over its history, including OpenEye Scientific Software Inc. for their initial applications or programming libraries that rely on Open OELib code. We also thank the Blue Obelisk Movement for ideas, comments on this manuscript, and support. We thank SourceForge for providing Babel, either calling the library directly or via one of the resources for issue tracking and managing releases, and Kitware for command-line executables. Table 2 contains examples additional dashboard resources. NMOB is supported by a Health Research of web applications and databases that either use Open Board Career Development Fellowship (PD/2009/13). Babel on the server or where Open Babel was used in Author details the preparation of the data. 1 Analytical and Biological Chemistry Research Facility, Cavanagh Pharmacy Building, University College Cork, Co. Cork, Ireland. 2Department of Chemistry, Technische Universität München, Garching D-85747, Germany. Conclusions 3 eMolecules, Inc., 420 Stevens Ave #120, Solana Beach, CA 92075, USA. In November 2011, Open Babel will mark 10 years of 4 Open Babel development team. 5University of Pittsburgh, Department of existence as an independent project, and for the first Chemistry, 219 Parkman Avenue, Pittsburgh, PA 15217, USA. time, we have discussed its development and features. Authors’ contributions As shown by more than 400 citations, it has become an GRH is the lead developer of the Open Babel project. CAJ, CM, MB, NMOB, essential tool for handling the myriad of molecular file and TV are developers of Open Babel. All authors read and approved the final manuscript. formats encountered in diverse branches of chemistry. While more work remains to be done, through valida- Competing interests tion processes such as those described above and the The authors declare that they have no competing interests. recent introduction of a nightly build and testing frame- Received: 27 June 2011 Accepted: 7 October 2011 work, we aim to improve the quality and robustness of Published: 7 October 2011 the toolkit with each new release. Looking forward to the future, one of the goals of the References 1. Weininger D: SMILES, a chemical language and information system. 1. project is to extend support to molecules that currently Introduction to methodology and encoding rules. J Chem Inf Comput Sci are not handled very well by existing cheminformatics 1988, 28:31-36. toolkits. Typically toolkits focus on the types of mole- 2. Murray-Rust P, Rzepa H: Chemical markup, XML, and the Worldwide Web. 1. Basic principles. J Chem Inf Comput Sci 1999, 39:928-942. cules of principal importance to the pharmaceutical 3. Murray-Rust P, Rzepa HS: Chemical Markup, XML and the World-Wide industry, namely stable organic molecules comprising Web. 2. Information Objects and the CMLDOM. J Chem Inf Model 2001, wholly of 2-center 2-electron covalent bonds. Molecules 41:1113-1123. 4. Murray-Rust P, Rzepa H, Wright M: Development of chemical markup outside this set - such as radicals, organometallic and language (CML) as a system for handling complex chemical content. inorganic molecules, molecules with coordinate bonds New J Chem 2001, 25:618-634. or 3-center 2-electron bonds - are poorly supported in 5. Murray-Rust P, Rzepa H: Chemical Markup, XML, and the World Wide Web. 4. CML Schema. J Chem Inf Comput Sci 2003, 43:757-772. general. Future releases of Open Babel will provide sub- 6. Holliday GL, Murray-Rust P, Rzepa HS: Chemical Markup, XML, and the stantially improved handling of such species. We also World Wide Web. 6. CMLReact, an XML Vocabulary for Chemical seek to improve speed and coverage of important meth- Reactions. J Chem Inf Model 2006, 46:145-157. 7. Daylight Theory: :, SMARTS http://www.daylight.com/dayhtml/doc/theory/ ods such as structure generation, kekulization and theory.smarts.html. canonicalization. 8. Fogel K: Producing Open Source Software: How to Run a Successful Free Open Babel is freely available from http://openbabel. Software Project O’Reilly Media, Inc. Sebastopol, CA; 2005. 9. Citations were generated by Google Scholar:[http://scholar.google.com/ org, and new community members are very welcome scholar? (users, developers, bug reporters, feature requesters). For as_q=openbabel&num=10&as_occt=any&as_publication=&as_ylo=2001]. information on how to use Open Babel, please see the 10. A selection of such projects is included below. :, The full list is available at: http://openbabel.org/wiki/Related_Projects. documentation at http://openbabel.org/docs and the API 11. Open Babel: :[http://openbabel.org/]. documentation at http://openbabel.org/api. 12. Open Babel Report Format: :[http://openbabel.org/docs/2.3.0/FileFormats/ Open_Babel_report_format.html]. 13. Open Babel Fingerprint Format: :[http://openbabel.org/docs/2.3.0/ Availability and Requirements FileFormats/Fingerprint_format.html]. Project Name: Open Babel 14. Open Babel Fastsearch Format: :[http://openbabel.org/docs/2.3.0/ Project home page: http://openbabel.org FileFormats/Fastsearch_format.html]. 15. MolPrint2D Format: :[http://openbabel.org/docs/2.3.0/FileFormats/ Operating system(s): Cross-platform MolPrint2D_format.html]. Programming language: C++, bindings to Python, 16. Bender A, Mussa HY, Glen RC, Reiling S: Molecular Similarity Searching Perl, Ruby, Java, C# Using Atom Environments, Information-Based Feature Selection, and a Naïve Bayesian Classifier. J Chem Inf Model 2004, 44:170-178. Other requirements (if compiling): CMake 2.4+ 17. MNA Format: :[http://openbabel.org/docs/2.3.0/FileFormats/ License: GNU GPL v2 Multilevel_Neighborhoods_of_Atoms_(MNA).html]. J. Cheminf. 2011, 3, 33.
  • 37. O’Boyle et al. Journal of Cheminformatics 2011, 3:33 Page 13 of 14 http://www.jcheminf.com/content/3/1/33 18. Filimonov D, Poroikov V, Borodina Y, Gloriozova T: Chemical Similarity 47. Arbor S, Marshall GR: A virtual library of constrained cyclic tetrapeptides Assessment through Multilevel Neighborhoods of Atoms: Definition and that mimics all four side-chain orientations for over half the reverse Comparison with the Other Descriptors. J Chem Inf Model 1999, turns in the protein data bank. J Comput-Aided Mol Des 2008, 23:87-95. 39:666-670. 48. Huang Z, Wong CF: A Mining Minima Approach to Exploring the 19. PDB Format v3.2: :[http://www.wwpdb.org/documentation/format32/v3.2. Docking Pathways of p-Nitrocatechol Sulfate to YopH. Biophys J 2007, html]. 93:4141-4150. 20. PDB: Cruft to Content: :[http://www.daylight.com/meetings/mug01/Sayle/ 49. Hill AD, Reilly PJ: A Gibbs free energy correlation for automated docking m4xbondage.html]. of carbohydrates. J Comput Chem 2008, 29:1131-1141. 21. Morgan HL: The Generation of a Unique Machine Description for 50. Armen RS, Chen J, Brooks CL III: An Evaluation of Explicit Receptor Chemical Structures-A Technique Developed at Chemical Abstracts Flexibility in Molecular Docking Using Molecular Dynamics and Torsion Service. J Chem Docum 1965, 5:107-113. Angle Molecular Dynamics. J Chem Theory Comp 2009, 5:2909-2923. 22. Nauty: :[http://cs.anu.edu.au/~bdm/nauty/]. 51. Liu L, Ma H, Yang N, Tang Y, Guo J, Tao W, Jaa Duan: A Series of Natural 23. McKay BD: Practical graph isomorphism. Congressus Numerantium 1981, Flavonoids as Thrombin Inhibitors: Structure-activity relationships. 30:45-87. Thromb Res 2010, 126:e365-e378. 24. Gakh A, Burnett M: Modular Chemical Descriptor Language (MCDL): 52. Wallach I, Jaitly N, Lilien R: A Structure-Based Approach for Mapping Composition, connectivity, and supplementary modules. J Chem Inf Adverse Drug Reactions to the Perturbation of Underlying Biological Comput Sci 2001, 41:1494-1499. Pathways. PLoS One 2010, 5:e12063. 25. Trepalin SV, Yarkov AV, Pletnev IV, Gakh AA: A Java Chemical Structure 53. Paila YD, Tiwari S, Sengupta D, Chattopadhyay A: Molecular modeling of Editor Supporting the Modular Chemical Descriptor Language (MCDL). the human serotonin1A receptor: role of membrane cholesterol in Molecules 2006, 11:219-231. ligand binding of the receptor. Molecular BioSystems 2011, 7:224-234. 26. Gakh AA, Burnett MN, Trepalin SV, Yarkov AV: Modular Chemical 54. Melville JL, Hirst JD: TMACC: Interpretable Correlation Descriptors for Descriptor Language (MCDL): Stereochemical modules. J Cheminf 2011, Quantitative Structure−Activity Relationships. J Chem Inf Model 2007, 3:5. 47:626-634. 27. Halgren T: Merck molecular force field .1. Basis, form, scope, 55. Pencheva T, Lagorce D, Pajeva I, Villoutreix BO, Miteva MA: AMMOS: parameterization, and performance of MMFF94. J Comput Chem 1996, Automated Molecular Mechanics Optimization tool for in silico 17:490-519. Screening. BMC Bioinformatics 2008, 9:438. 28. Halgren T: Merck molecular force field .2. MMFF94 van der Waals and 56. Schietgat L, Ramon J, Bruynooghe M: An Efficiently Computable Graph- electrostatic parameters for intermolecular interactions. J Comput Chem Based Metric for the Classification of Small Molecules. Proceedings of the 1996, 17:520-552. 11th International Conference on Discovery Science Springer-Verlag Berlin, 29. Halgren T: Merck molecular force field .3. Molecular geometries and Heidelberg; 2008, 197-209. vibrational frequencies for MMFF94. J Comput Chem 1996, 17:553-586. 57. Krier M, Hutter MC: Bioisosteric Similarity of Molecules Based on 30. Halgren T, Nachbar R: Merck molecular force field .4. Conformational Structural Alignment and Observed Chemical Replacements in Drugs. J energies and geometries for MMFF94. J Comput Chem 1996, 17:587-615. Chem Inf Model 2009, 49:1280-1297. 31. Halgren T: Merck molecular force field .5. Extension of MMFF94 using 58. Wang X, Huan J, Smalter A, Lushington GH: Application of kernel experimental data, additional computational data, and empirical rules. J functions for accurate similarity search in large chemical databases. BMC Comput Chem 1996, 17:616-641. Bioinformatics 2010, 11:S8. 32. Andronico A, Randall A, Benz RW, Baldi P: Data-driven high-throughput 59. Cheng T, Li Q, Wang Y, Bryant SH: Binary Classification of Aqueous prediction of the 3-D structure of small molecules: review and progress. Solubility Using Support Vector Machines with Reduction and J Chem Inf Model 2011, 51:760-776. Recombination Feature Selection. J Chem Inf Model 2011, 51:229-236. 33. Rappe A, Casewit C, Colwell K, Goddard W III, Skiff WM: UFF, a full periodic 60. Mihaleva VV, Verhoeven HA, de Vos RCH, Hall RD, van Ham RCHJ: table force field for molecular mechanics and molecular dynamics Automated procedure for candidate compound selection in GC-MS simulations. J Am Chem Soc 1992, 114:10024-10035. metabolomics based on prediction of Kovats retention index. 34. Wang J, Wolf RM, Caldwell JW, Kollman PA, Case DA: Development and Bioinformatics 2009, 25:787-794. testing of a general amber force field. J Comput Chem 2004, 61. Bas DC, Rogers DM, Jensen JH: Very fast prediction and rationalization of 25:1157-1174. pKa values for protein-ligand complexes. Proteins: Struct, Funct, Bioinf 35. Wang J, Wang W, Kollman PA, Case DA: Automatic atom type and bond 2008, 73:765-783. type perception in molecular mechanical calculations. J Molec Graph 62. Fabian L, Brock CP: A list of organic kryptoracemates. Acta Cryst 2010, Model 2006, 25:247-260. B66:94-103. 36. O’Boyle NM, Vandermeersch T, Flynn CJ, Maguire AR, Hutchison GR: Confab 63. Dehmer M, Barbarini N, Varmuza K, Graber A: A Large Scale Analysis of - Systematic generation of diverse low-energy conformers. J Cheminf Information-Theoretic Network Complexity Measures Using Chemical 2011, 3:8. Structures. PLoS One 2009, 4:e8057. 37. CMake: :[http://www.cmake.org/]. 64. Langham JJ, Jain AN: Accurate and Interpretable Computational 38. Martin K, Hoffman B: Mastering CMake: A Cross-Platform Build System. Modeling of Chemical Mutagenicity. J Chem Inf Model 2008, 48:1833-1839. Kitware, Inc., Clifton Park, NY;, 5 2010. 65. Fontaine F, Pastor M, Zamora I: Anchor-GRIND: Filling the gap between 39. CDash Dashboard for Open Babel: :[http://my.cdash.org/index.php? standard 3D QSAR and the GRid-INdependent Descriptors. J Med Chem project=Open+Babel]. 2005, 48(7):2687-94. 40. O’Boyle N, Morley C, Hutchison GR: Pybel: a Python wrapper for the 66. Konyk M, De Leon A, Dumontier M: Chemical knowledge for the semantic OpenBabel cheminformatics toolkit. Chem Cent J 2008, 2:5. web. Data Integration in the Life Sciences Springer-Verlag Berlin, Heidelberg; 41. Open Babel Bug Tracker: :[https://sourceforge.net/tracker/? 2008, 169-176. limit=25&func=&group_id=40728&atid=428740&status=2]. 67. Kogej T, Engkvist O, Blomberg N, Muresan S: Multifingerprint Based 42. Doxygen: :[http://www.doxygen.org/]. Similarity Searches for Targeted Class Compound Selection. J Chem Inf 43. Open Babel API: :[http://openbabel.org/api]. Model 2006, 46:1201-1213. 44. Myers J, Allison T, Bittner S, Didier B, Frenklach M, Green W, Ho Y, 68. Reynès C, Host H, Camproux A-C, Laconde G, Leroux F, Mazars A, Deprez B, Hewson J, Koegler W, Lansing C, et al: A collaborative informatics Fahraeus R, Villoutreix BO, Sperandio O: Designing Focused Chemical infrastructure for multi-scale science. Cluster Computing 2005, 8:243-253. Libraries Enriched in Protein-Protein Interaction Inhibitors using 45. Lind P, Alm M: A Database-Centric Virtual Chemistry System. J Chem Inf Machine-Learning Methods. PLoS Computational Biology 2010, 6:e1000695. Model 2006, 46:1034-1039. 69. Lagorce D, Pencheva T, Villoutreix BO, Miteva MA: DG-AMMOS: A New tool 46. Amini A, Shrimpton PJ, Muggleton SH, Sternberg MJE: A general approach to generate 3D conformation of small molecules using Distance for developing system-specific functions to score protein-ligand docked Geometry and Automated Molecular Mechanics Optimization for in complexes using support vector inductive logic programming. Proteins: silico Screening. BMC Chemical Biology 2009, 9:6. Struct, Funct, Bioinf 2007, 69:823-831. J. Cheminf. 2011, 3, 33.
  • 38. O’Boyle et al. Journal of Cheminformatics 2011, 3:33 Page 14 of 14 http://www.jcheminf.com/content/3/1/33 70. Gómez MJ, Pazos F, Guijarro FJ, de Lorenzo V, Valencia A: The 94. Miteva MA, Guyon F, Tuffery P: Frog2: Efficient 3D conformation environmental fate of organic pollutants through the global microbial ensemble generator for small compounds. Nucleic Acids Res 2010, 38: metabolism. Molecular Systems Biology 2007, 3:114. W622-W627. 71. Kazius J, Nijssen S, Kok J, Bäck T, IJzerman AP: Substructure Mining Using 95. Sharman JL, Mpamhanga CP, Spedding M, Germain P, Staels B, Dacquet C, Elaborate Chemical Representation. J Chem Inf Model 2006, 46:597-605. Laudet V, Harmar AJ, NC-IUPHAR: IUPHAR-DB: new receptors and tools for 72. O’Boyle NM, Tenderholt AL, Langner KM: cclib: A library for package- easy searching and visualization of pharmacological data. Nucleic Acids independent computational chemistry algorithms. J Comput Chem 2008, Res 2010, 39:D534-D538. 29:839-845. 96. Esposito R, Ermondi G, Caron G: OpenCDLig: a free web application for 73. Brüstle M: Chemtool - Moleküle zeichnen mit dem Pinguin. Nachrichten sharing resources about cyclodextrin/ligand complexes. J Comput-Aided aus der Chemie 2001, 49:1310-1313. Mol Des 2009, 23:669-675. 74. Buehler M, Dodson J, van Duin A: The Computational Materials Design 97. Wallach I, Lilien R: The protein-small-molecule database, a non-redundant Facility (CMDF): A powerful framework for multi-paradigm multi-scale structural resource for the analysis of protein-ligand binding. simulations. Materials Research Society symposium proceedings 2006, 894: Bioinformatics 2009, 25:615-620. LL3.8. 98. Poater A, Cosenza B, Correa A, Giudice S, Ragone F, Scarano V, Cavallo L: 75. Bullock CW, Jacob RB, McDougal OM, Hampikian G, Andersen T: Samb Vca: A Web Application for the Calculation of the Buried Volume Dockomatic - automated ligand creation and docking. BMC Research of N-Heterocyclic Carbene Ligands. Eur J Inorg Chem 2009, Notes 2010, 3:289. 2009:1759-1766. 76. Jiang X, Kumar K, Hu X, Wallqvist A, Reifman J: DOVIS 2.0: an efficient and 99. Yan B-b, Xue M-z, Xiong B, Liu K, Hu D-y, Shen J-k: ScafBank: a public easy to use parallel virtual screening tool based on AutoDock 4.0. Chem comprehensive Scaffold database to support molecular hopping. Acta Cent J 2008, 2:18. Pharmacologica Sinica 2009, 30:251-258. 77. Lagorce D, Sperandio O, Galons H, Miteva MA, Villoutreix BO: FAF-Drugs2: 100. Rydberg P, Gloriam DE, Olsen L: The SMARTCyp cytochrome P450 Free ADME/tox filtering tool to assist drug discovery and chemical metabolism prediction server. Bioinformatics 2010, 26:2988-2989. biology projects. BMC Bioinformatics 2008, 9:396. 101. Ingsriswang S, Pacharawongsakda E: sMOL Explorer: an open source, web- 78. Maunz A, Helma C, Kramer S: Efficient mining for structurally diverse enabled database and exploration tool for Small MOLecules datasets. subgraph patterns in large molecular databases. Machine Learning 2010, Bioinformatics 2007, 23:2498-2500. 83:193-218. 102. Bauer RA, Bourne PE, Formella A, Frommel C, Gille C, Goede A, Guerler A, 79. Maunz A, Helma C, Kramer S: Large-scale graph mining using backbone Hoppe A, Knapp EW, Poschel T, et al: Superimpose: a 3D structural refinement classes. Proceedings of the 15th ACM SIGKDD International superposition server. Nucleic Acids Res 2008, 36:W47-W54. Conference on Knowledge Discovery and Data Mining (KDD 2009) ACM Paris; 103. Schmidt U, Struck S, Gruening B, Hossbach J, Jaeger IS, Parol R, 2009, 617-626. Lindequist U, Teuscher E, Preissner R: SuperToxic: a comprehensive 80. Helma C: Lazy structure-activity relationships (lazar) for the prediction of database of toxic compounds. Nucleic Acids Res 2009, 37:D295-D299. rodent carcinogenicity and Salmonella mutagenicity. Mol Diversity 2006, 104. Bauer RA, Gunther S, Jansen D, Heeger C, Thaben PF, Preissner R: SuperSite: 10:147-158. dictionary of metabolite and drug binding sites in proteins. Nucleic Acids 81. Meineke MA, Vardeman CF, Lin T, Fennell CJ, Gezelter JD: OOPSE: an Res 2009, 37:D195-D200. object-oriented parallel simulation engine for molecular dynamics. J 105. Ahmed J, Preissner S, Dunkel M, Worth CL, Eckert A, Preissner R: Comput Chem 2005, 26:252-271. SuperSweet–a resource on natural and artificial sweetening agents. 82. Tosco P, Balle T: Brute-force pharmacophore assessment and scoring Nucleic Acids Res 2010, 39:D377-D382. with Open3DQSAR. J Cheminf 2011, 3(Suppl 1):P39. 106. Kuhn M, Szklarczyk D, Franceschini A, Campillos M, von Mering C, 83. Tosco P, Balle T: Open3DQSAR: a new open-source software aimed at Jensen LJ, Beyer A, Bork P: STITCH 2: an interaction network database for high-throughput chemometric analysis of molecular interaction fields. J small molecules and proteins. Nucleic Acids Res 2009, 38:D552-D556. Mol Model 2011, 17:201-208. 107. Tetko IV, Gasteiger J, Todeschini R, Mauri A, Livingstone D, Ertl P, 84. Filippov IV, Nicklaus MC: Optical Structure Recognition Software To Palyulin VA, Radchenko EV, Zefirov NS, Makarenko AS, et al: Virtual Recover Chemical Information: OSRA, An Open Source Solution. J Chem Computational Chemistry Laboratory - Design and Description. J Inf Model 2009, 49:740-743. Comput-Aided Mol Des 2005, 19:453-463. 85. Koes DR, Camacho CJ: Pharmer: Efficient and Exact Pharmacophore 108. Sperandio O, Petitjean M, Tuffery P: wwLigCSRre: a 3D ligand-based server Search. J Chem Inf Model 2011, 51(6):1307-14. for hit identification and optimization. Nucleic Acids Res 2009, 37: 86. Jacob CR, Beyhan SM, Bulo RE, Gomes ASP, Götz AW, Kiewisch K, Sikkema J, W504-W509. Visscher L: PyADF - A scripting framework for multiscale quantum chemistry. J Comput Chem 2011, 32:2328-2338. doi:10.1186/1758-2946-3-33 87. Green HWilliam, Allen WJoshua, Ashcraft WRobert, Beran JGregory, Cite this article as: O’Boyle et al.: Open Babel: An open chemical Class ACaleb, Gao Connie, Franklin Goldsmith C, Harper RMichael, toolbox. Journal of Cheminformatics 2011 3:33. Jalan Amrit, Magoon RGregory, Matheu MDavid, Merchant SShamel, Mo DJeffrey, Petway Sarah, Raman Sumathy, Sharma Sandeep, Song Jing, Van Geem MKevin, Wen John, West HRichard, Wong Andrew, Wong Hsi- Wu, Yelvington EPaul, Yu Joanna: RMG - Reaction Mechanism Generator v3.3. 2011 [http://rmg.sourceforge.net/]. 88. Karwath A, De Raedt L: SMIREP: Predicting Chemical Activity from SMILES. J Chem Inf Model 2006, 46:2432-2444. 89. Lonie DC, Zurek E: XTALOPT: An open-source evolutionary algorithm for Publish with ChemistryCentral and every crystal structure prediction. Comput Phys Commun 2011, 182:372-387. scientist can read your work free of charge 90. Zonta N, Grimstead IJ, Avis NJ, Brancale A: Accessible haptic technology for drug design applications. J Mol Model 2008, 15:193-196. Open access provides opportunities to our 91. Chen JH, Linstead E, Swamidass SJ, Wang D, Baldi P: ChemDB update full- colleagues in other parts of the globe, by allowing text search and virtual chemical space. Bioinformatics 2007, 23:2348-2351. anyone to view the content free of charge. 92. Backman TWH, Cao Y, Girke T: ChemMine tools: an online service for W. Jeffery Hurst, The Hershey Company. analyzing and clustering small molecules. Nucleic Acids Res 2011, 39(Web Server issue):W486-91. available free of charge to the entire scientific community 93. Ahmed J, Worth CL, Thaben P, Matzig C, Blasse C, Dunkel M, Preissner R: peer reviewed and published immediately upon acceptance FragmentStore–a comprehensive database of fragments linking cited in PubMed and archived on PubMed Central metabolites, toxic molecules and drugs. Nucleic Acids Res 2010, 39: D1049-D1054. yours you keep the copyright Submit your manuscript here: http://www.chemistrycentral.com/manuscript/ J. Cheminf. 2011, 3, 33.
  • 39. Part II Enzyme reaction mechanisms 39
  • 41. Vol. 21 no. 23 2005, pages 4315–4316 BIOINFORMATICSAPPLICATIONS NOTE doi:10.1093/bioinformatics/bti693 Databases and ontologies MACiE: a database of enzyme reaction mechanisms ,† Gemma L. Holliday1, Gail J. Bartlett2 , Daniel E. Almonacid1, Noel M. O’Boyle1, Peter Murray-Rust1, Janet M. Thornton2 and John B. O. Mitchell1,à 1 Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK and 2EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK Received on July 21, 2005; revised on September 22, 2005; accepted on September 23, 2005 Advance Access publication September 27, 2005 ABSTRACT DESIGN Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on October 22, 2011 Summary: MACiE (mechanism, annotation and classification in The MACiE dataset evolved from that published in the Catalytic enzymes) is a publicly available web-based database, held in Site Atlas (CSA) (Bartlett et al., 2002; Porter et al., 2004), and each CMLReact (an XML application), that aims to help our understanding entry is selected so that it fulfils the following criteria: of the evolution of enzyme catalytic mechanisms and also to create a classification system which reflects the actual chemical mechanism (1) There is a 3D crystal structure of the enzyme deposited in the (catalytic steps) of an enzyme reaction, not only the overall reaction. Protein Databank (PDB) (Berman et al., 2000). Availability: http://www-mitchell.ch.cam.ac.uk/macie/ (2) There is a relatively well-understood mechanism available. Contact: jbom1@cam.ac.uk Taken from the literature, these cover a variety of methodologies, including chemical and biochemical studies, A great deal of knowledge about enzymes, including structures, quantum mechanical calculations and structual biology gene sequences, mechanisms, metabolic pathways and kinetic reports. data, now exists. However, it is spread between many different (3) The enzyme is unique at the H level of the CATH databases and throughout the literature. Here we announce the classification—a hierarchical classification system of completion of the initial version of MACiE, a unique database of protein domain structures (Orengo et al., 1997)—unless the chemical mechanisms of enzymatic reactions. there is a homologue with a significantly different chemical Web resources such as BRENDA (Schomburg et al., 2004), mechanism. KEGG (Kanehisa et al., 2004) and the International Union of Bio- (4) Where there are a number of possible PDB codes available chemistry and Molecular Biology (IUBMB) Enzyme Nomenclature the entry should be, if possible, a wild-type enzyme. website (IUBMB, 2005, http://www.chem.qmul.ac.uk/iubmb/ enzyme/) contain descriptions of the overall reactions performed All MACiE enzymes are also contained in the Enzyme Commis- by enzymes, accompanied in some cases by a textual or graphical sion (EC) classification system (IUBMB, 2005, http://www.chem. description of the mechansim. MACiE is unique in combining qmul.ac.uk/iubmb/enzyme/), that is, they all have four number codes detailed stepwise mechanistic information (including 2D anima- describing their overall reaction. The first level (Class) describes tions), a wide coverage of both chemical space and the protein the basic reaction type. The second and third levels (subclass and structure universe, and the chemical intelligence of CMLReact sub-subclass, respectively) describe the reaction in further detail (Holliday,C.L., Murray-Rust,P., and Rzepa,H.S., 2005, manuscript and the final level (serial number) describes substrate specificity. submitted to J. Chem. Inf. Modeling). MACiE usefully complements For example, the b-lactamases (Fig. 1) are assigned the EC number both the mechanistic detail of the Structure–Function Linkage 3.5.2.6, i.e. a hydrolase (3) acting on a C–N bond (5) in a cyclic Database (SFLD) for a small number of enzyme superfamilies amide (2) with a b-lactam as the substrate (6). (Pegg et al., 2005) and the wider coverage with less chemical In MACiE, the data centre on the catalytic steps involved in the detail provided by EzCatDB (Nagano, 2005) which also contains chemical mechanism as well as the overall reaction. Each entry a limited number of 3D animations. includes the following steps: Enzyme name and EC number à To whom correspondence should be addressed. PDB code and CATH codes of all domains in the enzyme † Present Address: Bioinformatics Support Service (Biochemistry Building), Diagram and annotation of the overall reaction Centre for Bioinformatics, Division of Molecular Biosciences, Faculty of Life Sciences, Imperial College London, London, SW7 2AZ, UK Primary literature references Ó The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org Bioinformatics. 2005, 21, 4315-4316.
  • 42. G.L.Holliday et al. R R' R R' CURATION The annotation process involves input and validation steps. Terms H have been rigorously defined either from the IUPAC Gold Book H + O N R'' (McNaught et al., 1997), such as chemical terms like hydrolysis, or N O O H from primary literature, such as mechanism, which is defined using O R'' H Ingold’s terminology (Ingold, 1969), originally put forward in the 1930s. All of the technical and scientific terms used in MACiE are Fig. 1. The overall reaction for a b-lactamase. contained in the MACiE dictionary, which is available at the URL http://www-mitchell.ch.cam.ac.uk/macie/glossary.html and is also Diagram and annotation of all reaction steps, including: available as a raw XML file. —The Ingold mechanism (Ingold, 1969) The entries online are accessed via an HTML look-up table and —Diagram and function of catalytic amino acid residues include all of the information available in the database. The original —Information on the reactive centres and bond changes ISIS/Base format file and the raw CML files can be supplied. Comments on the reaction (where applicable). CONTENT FUTURE WORK Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on October 22, 2011 The criteria defined in the Design section initially produced a Future work includes expanding the dataset to include a representat- dataset of 100 entries. A single EC number may cover a plurality ive set of EC numbers (at the sub-subclass level), creating a search of MACiE entries when different mechanisms bring about the interface for MACiE and developing authoring tools for MACiE same overall chemical transformation, as with the two types of in CML. Ongoing research focuses on the evolution of enzyme 3-dehydroquinate dehydratase, and thus 100 MACiE entries span catalysis and the classification of enzyme reaction mechanisms. only 96 EC numbers. The 100 enzymes in Version 1 of MACiE incorporate domains from 140 CATH homologous superfamilies. MACiE currently cov- ACKNOWLEDGEMENTS ers 56 of the 174 EC sub-subclasses present in the PDB, thus, we G.J.B. would like to thank Dr Jonathan Goodman for his invaluable feel that we have a representative coverage of EC reaction space help with organic chemistry queries. We would also like to (comparative EC wheels are available at URL http://www-mitchell. thank the EPSRC (G.L.H. and J.B.O.M.), the BBSRC (G.J.B. and ch.cam.ac.uk/macie/ECCoverage/). We anticipate that all 158 sub- J.M.T.—CASE studentship in association with Roche Products subclasses for which both structures and reliable mechanisms are Ltd; N.M.O.B. and J.B.O.M.—grant BB/C51320X/1), the Chilean available will be represented in the forthcoming MACiE Version 2. ´ Government’s Ministerio de Planificacion y Cooperacion and´ Cambridge Overseas Trust (D.E.A.) for funding and Unilever for SOFTWARE supporting the Centre for Molecular Science Informatics. The data are initially entered in MDL’s ISIS/Base, a database pack- Conflict of Interest: none declared. age for chemical reactions, validated by at least two people, and then converted into CMLReact using the Jumbo Toolkit (Wakelin et al., 2005) to create an information and semantically rich database. REFERENCES At this stage we add extra fields of information to the CMLReact version of MACiE that are unavailable in the ISIS version, Bartlett,G.J. et al. (2002) Analysis of catalytic residues in enzyme active sites. J. Mol. Biol., 324, 105–121. including the CATH code. Jumbo is a set of Java-based software Berman,H.M. et al. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235–242. which converts the MDL file format produced from ISIS/Base into Holliday,G.L. et al. (2004) CMLSnap: animated reaction mechanisms. Internet CMLReact. The MacieConverter section of Jumbo performs the J. Chem., 7, Article 4. following functions: Ingold,C.K. (1969) Structure and Mechanism in Organic Chemistry. 2nd edn, Cornell University Press, Ithaca, NY, Chapters 5–15. Integration of the files in the ISIS/Base version of MACiE Kanehisa,M. et al. (2004) The KEGG resource for deciphering the genome. Nucleic Acids Res., 32, D277–D280. Identification of reactant, product and spectator molecules McNaught,A.D. and Wilkinson,A. (1997) International Union of Pure and Applied Splitting of groups of molecules Chemistry Compendium of Chemical Terminology (‘‘The Gold Book’’). 2nd edn, ISBN 0-8-654-26848. Automatic mapping of atoms within the reaction Nagano,N. (2005) EzCatDB: the Enzyme Catalytic-mechanism Database. Nucleic Checking for mass and charge conservation throughout the reac- Acids Res., 33, D407–D412. Orengo,C.A. et al. (1997) CATH—a hierarchic classification of protein domain tion (stoichiometry) structures. Structure, 5, 1093–1108. Integration and checking of MACiE Dictionary entries. Pegg,S.C-H. et al. (2005) Representing structure-function relationships in mechanis- tically diverse enzyme superfamilies. Pac. Symp. Biocomput., 358–369. Once the conversion process has been completed, a further tool in Porter,C.T. et al. (2004) The Catalytic Site Atlas: a resource of catalytic sites and the Jumbo Toolkit, called CMLSnap (Holliday et al., 2004), can residues identified in enzymes using structural data. Nucleic Acids Res., 32, be used to create an animation of the reaction. This animation D129–D133. Schomburg,I. et al. (2004) BRENDA, the enzyme database: updates and major new includes all of the atoms and bonds involved as well as the electron developments. Nucleic Acids Res., 32, D431–D433. movements, which are calculated automatically. It is expected that Wakelin,J. et al. (2005) CML tools and information flow in atomic scale simulations. CML will become our primary method of data entry and storage. Mol. Simul., 31, 315–322. 4316 Bioinformatics. 2005, 21, 4315-4316.
  • 43. Published online 1 November 2006 Nucleic Acids Research, 2007, Vol. 35, Database issue D515–D520 doi:10.1093/nar/gkl774 MACiE (Mechanism, Annotation and Classification in Enzymes): novel tools for searching catalytic mechanisms Gemma L. Holliday*, Daniel E. Almonacid1, Gail J. Bartlett, Noel M. O’Boyle1, James W. Torrance, Peter Murray-Rust1, John B. O. Mitchell1 and Janet M. Thornton EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and 1Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK Received August 4, 2006; Revised September 18, 2006; Accepted October 1, 2006 Downloaded from http://nar.oxfordjournals.org/ by guest on October 22, 2011 ABSTRACT data, it tends to be spread between many different databases and throughout the literature. Most web resources relating to MACiE (Mechanism, Annotation and Classification in enzymes [such as BRENDA (1), KEGG (2), the IUBMB Enzymes) is a database of enzyme reaction mecha- Enzyme Nomenclature website (http://www.chem.qmul.ac. nisms, and is publicly available as a web-based data uk/iubmb/enzyme/) (3) and IntEnz (4)] focus on the overall resource. This paper presents the first release of a reaction, accompanied in some cases by a textual or graphical web-based search tool to explore enzyme reaction description of the mechanism. However, this does not allow mechanisms in MACiE. We also present Version 2 of for detailed in silico searching of the chemical steps which MACiE, which doubles the dataset available (from take place in the reaction. MACiE (5) combines detailed Version 1). MACiE can be accessed from http://www. stepwise mechanistic information [including 2-D animations ebi.ac.uk/thornton-srv/databases/MACiE/ (6)], a wide coverage of both chemical space and the protein structure universe, and the chemical intelligence of the Chemical Markup Language for Reactions (CMLReact) (7). This usefully complements both the mechanistic detail of INTRODUCTION the Structure–Function Linkage Database (SFLD) for a Enzymes are proteins that catalyse the repertoire of chemical small number of rather ‘promiscuous’ enzyme superfamilies reactions found in nature, and as such are vitally important (8) and the wider coverage with less chemical detail provided molecules. What is so fascinating about these proteins is by EzCatDB (9), which also contains a limited number of 3D that they have a wonderful diversity and can carry out highly animations. Entries in MACiE are linked, where appropriate, complex chemical conversions under physiological condi- to all of these related data resources. tions and retain their stereospecificity and regiospecificity, unlike many organic chemical reactions. They range in size and can have molecular weights of several thousand to sev- DATASET AND CONTENT eral million Daltons, and still they can catalyse reactions on The dataset for MACiE version 2 was devised to increase the molecules as small as carbon dioxide or nitrogen, or as large enzyme reaction space coverage of MACiE while trying to as a complete chromosome. keep structural homology to a minimum. Each entry added Although enzymes are large molecules, the actual catalysis in the new version was selected so that it fulfils the following only takes place in a small cavity, the active site. It is criteria: here that a small number of amino acid residues contribute to catalytic function, and where the substrates bind. With (i) The EC sub-subclass was not previously in MACiE. the advent of structure determination methods for proteins (ii) There is a three-dimensional crystal structure of the and by using clever chemical/biochemical experimental enzyme deposited in the Protein Data Bank (wwPDB) design, scientists have been able to propose catalytic mecha- (10). nisms for many enzymes. Although a great deal of knowledge (iii) There is a mechanism available from the primary exists for enzymes, including their structures, gene literature which explains most of the observed experi- sequences, mechanisms, metabolic pathways and kinetic mental results. *To whom correspondence should be addressed. Tel: +44 1223 492535; Fax: +44 1223 494486; Email: gemma@ebi.ac.uk Present address: Gail J. Bartlett, Division of Mathematical Biology, National Institute of Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, UK Ó 2006 The Author(s). This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acid Res. 2007, 35, D515-D520.
  • 44. D516 Nucleic Acids Research, 2007, Vol. 35, Database issue Figure 1. EC wheels showing the EC coverage of MACiE Version 2 (left), the complete EC space (centre) and the coverage of EC space in the PDB by unique EC serial numbers (right). (iv) The enzyme is unique at the H level of the CATH code Table 1. Overall reaction annotation content (11), unless the homologue already in MACiE has a Catalysis and reaction Non-catalysis significantly different chemical mechanism. Downloaded from http://nar.oxfordjournals.org/ by guest on October 22, 2011 specific information specific information Using the above criteria MACiE was expanded from Enzyme name PDB code 100 entries in version 1 to a total of 202 entries, which (common IUPAB/JCBN name) EC code Non-catalytic domain CATH code span 199 EC numbers (version 1 spanned 96 EC numbers) Catalytic residues involved Non-catalytic UniProt code and covers a total of 862 reaction steps. There are almost Cofactors involved Species name (common and scientific) 4000 EC numbers defined, but the number of different Reactants and products Other database reaction mechanisms needed to bring about all these overall identifiers, e.g. EzCatDB, SFLD, etc. transformations is not clear. For example, the serine protease Catalytic domain CATH code Literature references Catalytic UniProt code family of proteins has many different substrates, but the Bonds involved, formed, mechanisms are broadly similar. In contrast the b-lactamase cleaved, changed in order enzymes, which have the same EC number, have four com- Reactive centres pletely different mechanisms. Within the EC code, the fourth Overall reaction comments digit usually defines the substrate specificity, which can be very variable in large enzyme families—but the reaction mechanisms for enzymes with the same first three digits are the current release such annotations are only available as usually essentially the same. In total there are 224 EC sub- comments on the stage or overall reaction, although future subclasses, with only 181 having known structures (12). Of releases of MACiE will include full entries for these alterna- these MACiE covers 158, i.e. 87%. However, there are proba- tives. bly many more mechanisms that are yet to be defined or Further details of the annotation process and a glossary of discovered. terms used can be found on the MACiE website (http://www. As can be seen from Figure 1, MACiE covers a good ebi.ac.uk/thornton-srv/databases/MACiE/documentation/ and proportion of the EC reaction space, with an average relative http://www.ebi.ac.uk/thornton-srv/databases/MACiE/glossary. difference between the size of corresponding EC classes html, respectively). of 4%, with the transferases having the largest difference. When the coverage with respect to EC code present in the PDB is examined, it can be seen that MACiE again DATABASE STRUCTURE represents the coverage of enzymes with known structures very well, with an average relative difference between the The challenge with MACiE has been to capture and usefully corresponding EC classes in MACiE of 5%. represent all the different catalytic steps that occur during the All entries in MACiE contain overall reaction annotation course of an enzymatic reaction. These reactions may consist including the information detailed in Table 1. Each elemen- of any number of steps, and in MACiE we have reactions tary reaction or step within an entry is fully annotated as is ranging from 1 step to 16 steps. The representation of these detailed in Figure 2, this includes comments that have been reactions has evolved from a flat file entered in a commer- added by the annotators. An extension of the content from cially available chemical database program (ISIS/Base) to MACiE Version 1 is the addition of inferred return steps. the highly structured and powerful CMLReact (7), which is These are explicitly labelled as being inferred in the comment an application of XML (the eXtensible Markup Language). field and are necessary to return the enzyme to a state where it The final step in this evolution has been the conversion of is ready to undergo another round of catalysis. the CMLReact into the relational database format of MySQL. There is sometimes more than one proposed mechanism that CMLReact has a heirarchical structure, facilitating its is consistent with the available experimental data. In MACiE, conversion into the relational database format of MySQL. we have attempted not only to choose the best supported The conversion relies on the CML Schema and requires the mechanism, but also where possible to annotate enzymes MACiE entries to be consistent with the Schema, which with reasonable alternative mechanisms. Unfortunately, in adds an internal consistency check into our authoring process. Nucleic Acid Res. 2007, 35, D515-D520.
  • 45. Nucleic Acids Research, 2007, Vol. 35, Database issue D517 Downloaded from http://nar.oxfordjournals.org/ by guest on October 22, 2011 Figure 2. An example of the annotation found in a MACiE entry. Reaction shown corresponds to fructose-bisphosphate aldolase (entry 52). Table 2. Searches available in MACiE Basic Complex MACiE entry identifier Species name (overall annotation) Current EC codes Overall reactants and products Obsolete EC codes Reaction comments (overall reactions and steps) Catalytic Domain Amino acid residues (up to six residues) CATH codes All CATH codes Step mechanisms and/or mechanism components (single and combinations of) PDB code Chemical changes Figure 3. EC code search heuristics. Enzyme name Chemical changes with mechanism or mechanism components Catalytic Domain Chemical changes with amino acid DATABASE FEATURES UniProt Codes residues All UniProt Codes Amino acid residues with mechanism or The original release of MACiE contained static images and mechanism components annotation for the overall reaction and each step associated Chemical changes with amino acid with the mechanism; it also included an animated reaction residues and mechanisms or mechanism mechanism for approximately half the reactions then in components Alternative mechanisms MACiE. Links to various related resources, such as the RCSB PDB (13), IUBMB nomenclature database, CATH, EzCatDB, PDBSum (14), BRENDA, the Catalytic Site Atlas (15), KEGG and the Enzyme Structures Database, Each CML tag-type becomes an MySQL table; each tag were also included. This new release extends these links to becomes a row in that MySQL table; each attribute of that include the Macromolecular Structures Database (MSD) tag corresponds to a column in the MySQL table. The tree (16), SFLD, UniProt (17), and replaces the IUBMB nomen- structure of the CML is preserved in the MySQL version; clature database links with links to IntEnz. The new features for each row of each table, there are columns specifying in MACiE are detailed in the following sections. which row of which other table corresponds to the row’s parent tag in the CML version. Searching MACiE The CML version of MACiE, which is the official archive version, is available from the website as individual entries, There are two levels of search implemented in MACiE. The and the new website uses the relational version of MACiE basic level searches are implemented from the main page to perform the online analysis and searching. (http://www.ebi.ac.uk/thornton-srv/databases/MACiE) and are Nucleic Acid Res. 2007, 35, D515-D520.
  • 46. D518 Nucleic Acids Research, 2007, Vol. 35, Database issue Figure 4. Advanced EC search heuristics. Downloaded from http://nar.oxfordjournals.org/ by guest on October 22, 2011 walk up the EC code tree until it finds a match, no matter at what level the search is entered. Thus the search will always return a result. As the EC code of enzymes may change over time, a search for obsolete EC codes has also been imple- mented, although this search will not always return a result. However, it should be noted that the higher up the EC hierar- chy search has gone, the less likely it is that the returned mechanism will be a match to the query. The obsolete EC code search works in the same way as the current EC code. If no matches are found at the serial number level of the Figure 5. PDB search heuristics. EC code, an advanced search option will allow the user to search for a structural homologue of an enzyme with a given EC code, which is shown in Figure 4 and described below. This advanced search option takes the entered EC code and finds the PDB codes of all of the matches to that EC code in the Catalytic Site Atlas (CSA). A homology search is then performed on those PDB codes for a match in MACiE. This homology search is described in more detail in the following section. The CSA is a database of catalytic residues in proteins of known structure. It contains much less mechanistic informa- tion than MACiE, but has a considerably wider coverage of protein structures than MACiE does. This wider coverage is Figure 6. Enzyme name search heuristics. partly because the CSA contains not only manually annotated entries, but also contains entries that are automatically annotated based on sequence alignment to the manual entries. mainly for accessing the entries from the top level, i.e. for searching entries in MACiE by EC code, enzyme name, PDB code. There are over 19 000 crystal structures relating to etc. The complex searches are all available from the query enzymes deposited in the PDB. As MACiE entries require pages of MACiE (http://www.ebi.ac.uk/thornton-srv/databases/ extensive literature searching and analysis, only a small MACiE/queryMACiE.html) and are mainly for searching for fraction of these PDB entries are covered explicitly, 202 in specific mechanisms, mechanism components or residues and total. However, we have used the CSA to identify homologues their functions in the reaction steps, although there are some of these enzymes, extending this coverage to 7528 PDB codes. overall reaction searches implemented as well. Table 2 lists Figure 5 details the search performed in MACiE, when a the searches available in MACiE and the Supplementary Data protein structure described by a PDB code is entered. contain a detailed listing of the searches available. Although the entries returned by this search will be homo- The following sections describe searching by EC code, logues, this does not guarantee that the mechanism and the PDB code or enzyme name, all of which use heuristics to catalytic residue assignments are the same. This is because extend the coverage of MACiE. the homology method (see below) can retrieve very distant relatives. Owing to this limitation, all homologous entries EC code. The EC code search implemented in MACiE is are compared by EC code, and when there is a divergence detailed in Figure 3 and can be accessed at any point in the between the MACiE entry and the homologue at the serial scheme shown. The search for current EC numbers will always number level, this is clearly indicated to the user. We also Nucleic Acid Res. 2007, 35, D515-D520.
  • 47. Nucleic Acids Research, 2007, Vol. 35, Database issue D519 list the amino acid residues that are annotated as catalytic in the results page we link both to the MACiE entry and the both MACiE and the CSA. Thus it is clear if there is any CSA entry. difference between EC numbers and catalytic residues. If the EC number differs but the catalytic residues between Homology in MACiE. We have been working to bring query and homologue are of identical types, it can be inferred MACiE and the CSA closer together. This includes using that the mechanisms are likely to be the same, but where both the CSA to determine homologues (those enzymes which differ, the mechanisms are unlikely to be transferable. From are evolutionarily related) of entries in MACiE. The CSA finds homologues using a PSI-BLAST search (with an E-value cut-off of 0.0005 and five iterations) against all sequences currently in the PDB, plus all sequences in a non-redundant subset of UniProt. The UniProt sequences are included purely in order to increase the range of the PSI-BLAST search by bridging gaps between distantly related sequences in the PDB; only sequences occurring in the PDB are retrieved for entry into the CSA. In the CSA, and thus MACiE, homologous entries are only included if the residues which align with the catalytic residues in the parent literature entry are identical in residue type. In Downloaded from http://nar.oxfordjournals.org/ by guest on October 22, 2011 other words, there must be no mutations at the catalytic res- idue positions. There are, however, a few exceptions to this rule: (i) In order to allow for the many active site mutants in the PDB, one (and only one) catalytic residue per site can be different in type from the equivalent in the parent literature entry. This is only permissible if all residue spacing is identical to that in the parent literature entry, and there are at least two catalytic residues. (ii) Sites with only one catalytic residue are permitted to be mutant provided that the residue number is identical to that in the parent entry. (iii) Fuzzy matching of residues is permitted within the Figure 7. Growth of MACiE. This shows the growth in the number of EC following groups: [V,L,I], [F,W,Y], [S,T], [D,E], [K,R], codes (blue), EC sub-sub classes (cyan) and catalytic domain CATH codes [D,N], [E,Q], [N,Q]. This fuzzy matching cannot be used (red) in MACiE. in combination with rules (i) or (ii) above. Figure 8. Frequency distribution of amino acid residues. This shows the frequency of catalytic amino acid residues in MACiE (blue), versus the frequency of residues in MACiE (cyan), versus the frequency of residues in the wwPDB (red). The frequency of catalytic amino acid residues in MACiE is calculated by taking the number of residues (of a given type) annotated in MACiE divided by the total number of annotated residues in MACiE, multiplied by 100. Nucleic Acid Res. 2007, 35, D515-D520.
  • 48. D520 Nucleic Acids Research, 2007, Vol. 35, Database issue Enzyme name. This is currently implemented as a partial and is also affiliated with Cambridge University Department string match, thus entering ‘beta’ will return all the of Chemistry. Funding to pay the Open Access publication b-lactamases and betaine-aldehyde dehydrogenase. If no charges for this article was provided by the Wellcome Trust. results are returned from the partial name search, then the name search heuristics (shown in Figure 6) are implemented. Conflict of interest statement. None declared. This search utilizes the IntEnz database (4). MACiE searches for a name in IntEnz, either a synonym, alternative name or common name, and returns the EC code of that REFERENCES name. The EC code is then used to search MACiE. If no matches are found to the sub-subclass level of the EC code, 1. Schomburg,I., Chang,A., Ebeling,C., Gremse,M., Heldt,C., Huhn,G. the user is offered an advanced EC code search (see Figure 4). and Schomburg,D. (2004) BRENDA, the enzyme database: updates and major new developments. Nucleic Acids Res., 32, D431–D433. Statistics 2. Kanehisa,M., Goto,S., Kawashima,S., Okuno,Y. and Hattori,M. (2004) The other major development in MACiE has been the The KEGG resource for deciphering the genome. Nucleic Acids Res., 32, D277–D280. inclusion of database statistics that are all generated on the 3. IUBMB (2005) Recommendations of the Nomenclature Committee of fly from the SQL tables. A full listing of the statistics the International Union of Biochemistry and Molecular Biology on the available can be found in the Supplementary Data. The nomenclature and classification of enzyme-catalysed reactions. growth of MACiE is shown in Figure 7 in terms of EC 4. Fleischmann,A., Darsow,M., Degtyarenko,K., Fleischmann,W., Downloaded from http://nar.oxfordjournals.org/ by guest on October 22, 2011 coverage and CATH coverage. Boyce,S., Axelsen,K., Bairoch,A., Schomburg,D., Tipton,K.F. and Apweiler,R. (2004) IntEnz, the integrated relational enzyme database. The statistics in MACiE can also be used to examine the Nucleic Acids Res., 32, D434–D437. function and distribution of amino acid residues (G.L. Holliday, 5. Holliday,G.L., Bartlett,G.J., Almonacid,D.E., O’Boyle,N.M., D.E. Almonacid, J.M. Thornton and J.B.O. Mitchell, Murray-Rust,P., Thornton,J.M. and Mitchell,J.B.O. (2005) MACiE: a manuscript in preparation) (see Figure 8), the distribution of database of enzyme reaction mechanisms. Bioinformatics, 21, 4315–4316. mechanism and mechanism components and the bond order 6. Holliday,G.L., Mitchell,J.B.O. and Murray-Rust,P. (2004) CMLSnap: changes occurring in each step of the reaction. animated reaction mechanisms. Internet J. Chem., 7, Article 4. 7. Holliday,G.L., Murray-Rust,P. and Rzepa,H.S. (2006) Chemical Markup, XML, and the World Wide Web. 6. CMLReact, an FUTURE DEVELOPMENTS XML vocabulary for chemical reactions. J. Chem. Inf. Model., 46, 145–157. MACiE is a continually developing resource, and in the 8. Pegg,S.C.-H., Brown,S.D., Ojha,S., Seffernick,J., Meng,E.C., future we hope to include 3D data, which will incorporate Morris,J.H., Chang,P.J., Huang,C.C., Ferrin,T.E. and Babbitt,P.C. (2006) Leveraging enzyme structure–function relationships for various statistics and searches related to the analysis of functional inference and experimental design: the Structure–Function these data. We will also continue to extend the coverage of Linkage Database. Biochemistry, 45, 2545–2555. MACiE to include alternative reaction mechanisms that 9. Nagano,N. (2005) EzCatDB: the Enzyme Catalytic-mechanism have been suggested for various enzymes, as well as new DataBase. Nucleic Acids Res., 33, D407–D412. 10. Berman,H.M., Henrick,K. and Nakamura,H. (2003) Announcing mechanisms. Finally, we intend to build a user interface the worldwide Protein Data Bank. Nature Struct. Biol., which will allow for chemical diagrams to be drawn 10, 980. and used to search MACiE, an entry process which is more 11. Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and usable and also to implement the classification of enzyme Thornton,J.M. (1997) CATH—a hierarchic classification of protein mechanisms that we are developing. domain structures. Structure, 5, 1093–1108. 12. Martin,A.C. (2004) PDBSprotEC: a Web-accessible database linking PDB chains to EC numbers via SwissProt. Bioinformatics, 20, 986–988. SUPPLEMENTARY DATA 13. Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data Supplementary Data are available at NAR Online. Bank. Nucleic Acids Res., 28, 235–242. 14. Laskowski,R.A., Chistyakov,V.V. and Thornton,J.M. (2005) PDBsum more: new summaries and analyses of the known 3D ACKNOWLEDGEMENTS structures of proteins and nucleic acids. Nucleic Acids Res., 33, D266–D268. We would like to thank the EPSRC (G.L.H. and J.B.O.M.), 15. Porter,C.T., Bartlett,G.J. and Thornton,J.M. (2004) The Catalytic Site BBSRC (G.J.B. and J.M.T.—CASE studentship in associa- Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res., 32, D129–D133. tion with Roche Products Ltd; N.M.O.B. and J.B.O.M.—grant 16. Golovin,A., Oldfield,T.J., Tate,J.G., Velankar,S., Barton,G.J., BB/C51320X/1), the Wellcome Trust, EMBL, IBM (G.L.H. Boutselakis,H., Dimitropoulos,D., Fillon,J., Hussain,A., Ionides,J.M. and J.M.T.), the Chilean Government’s Ministerio de et al. (2004) E-MSD: an integrated data resource for bioinformatics. ´ ´ Planificacion y Cooperacion and the Cambridge Overseas Nucleic Acids Res., 32, D211–D216. 17. Bairoch,A., Apweiler,R., Wu,C.H., Barker,W.C., Boeckmann,B., Trust (D.E.A.) for funding and Unilever for supporting the Ferro,S., Gasteiger,E., Huang,H., Lopez,R., Magrane,M. et al. (2005) Centre for Molecular Science Informatics. J.W.T. is funded The Universal Protein Resource (UniProt). Nucleic Acids Res., 33, by a European Molecular Biology Laboratory studentship, D154–D159. Nucleic Acid Res. 2007, 35, D515-D520.
  • 51. Vol. 22 no. 20 2006, pages 2565–2566 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btl416 Data and text mining PYCHEM: a multivariate analysis package for python Roger M. Jarvis1,4,Ã, David Broadhurst1,4, Helen Johnson2, Noel M. O’Boyle3 and Royston Goodacre1,4 1 School of Chemistry, The University of Manchester, PO Box 88, Sackville Street, Manchester M60 1QD, UK, 2 Faculty of Life Sciences, University of Manchester, Stopford Building, Oxford Road, Manchester M13 9PT, UK, 3 Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, CB2 1EW, UK and 4Manchester Interdisciplinary Biocentre, 131 Princess Street, Manchester M1 7DN, UK Received on April 4, 2006; revised on July 5, 2006; accepted on July 26, 2006 Advance Access publication July 31, 2006 Associate Editor: Martin Bishop trait(s); and discriminant analysis, for distinguishing between Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on October 22, 2011 ABSTRACT Summary: We have implemented a multivariate statistical analysis different sample groups and for subsequent predictions on new toolbox, with an optional standalone graphical user interface (GUI), samples. In fact, multivariate analysis encompasses many more using the Python scripting language. This is a free and open source methods than these examples of linear modeling imply (Brereton, project that addresses the need for a multivariate analysis toolbox 2003); but these tools are perhaps those most commonly used for the in Python. Although the functionality provided does not cover the full modeling of biological data. range of multivariate tools that are available, it has a broad comple- Many programs currently exist for multivariate analysis. Flexible ment of methods that are widely used in the biological sciences. In environments for mathematical computing are available in the contrast to tools like MATLAB, PyChem 2.0.0 is easily accessible form of MATLAB (The Mathworks, Natick, MA, USA), GNU and free, allows for rapid extension using a range of Python modules Octave (http://www.octave.org/) which aims to be a free equivalent and is part of the growing amount of complementary and interoperable of MATLAB, and R (http://www.r-project.org/); which has many scientific software in Python based upon SciPy. One of the other bio-analysis modules, such as Vegan (for environmetrics) and attractions of PyChem is that it is an open source project and so Bioconductor (for genomic analysis). These products provide there is an opportunity, through collaboration, to increase the scope powerful tools for multivariate analysis through command line of the software and to continually evolve a user-friendly platform that interpreters, which allow the user to perform their analysis with has applicability across a wide range of analytical and post-genomic a great degree of flexibility. However, they require some investment disciplines. in time to become familiar with the interpreters syntax, and are not Availability: http://sourceforge.net/projects/pychem necessarily straightforward for people with little computa- Contact: Roger.Jarvis@manchester.ac.uk or admin@pychem.org.uk tional experience. In addition, a number of graphical multivariate Supplementary information: Further information is available from the ˚ software tools are also available; Evince (UmBio, Umea, Sweden), project home page at http://pychem.sf.net/ whilst details of data gen- The Unscrambler (CAMO, Woodbridge, NJ, USA), Pirouette eration are available at http://biospec.net/ (Infometrix, Bothell, WA, USA), S-Plus (Insightful, Seattle, WA. ˚ USA) and SIMCA (Umetrics, Umea, Sweden) are all good tools for 1 INTRODUCTION basic multivariate analysis although, with the exception of S-Plus, they lack the flexibility of the interpreter style interfaces. Increasingly in the life sciences many experiments generate data Thus there is currently a requirement for a flexible, extensible, free which are of a multivariate nature, where many observations are and open source graphical environment for performing multivariate recorded for each sample under analysis. Interpretation of such analysis, which can be used by both experts and casual users. The complex data cannot generally be performed by taking a univariate increasing popularity of scripting languages such as Python (http:// approach, since no single measurement is necessarily adequate www.python.org/) within the life sciences community offers the enough to describe the problem being addressed. In fact, the technology and critical mass for such a project. A platform of this application of univariate methodology is in many cases totally inap- type addresses the requirements outlined above, with the additional propriate as the complexity of information contained within large benefit that it allows for the rapid development of new cross-platform biological datasets reflects the complexity of the system(s) being software approaches, and the integration of currently available soft- studied. Typical multivariate analysis problems involve unsuper- ware libraries through application programming interfaces (APIs). vised learning such as factor analysis, for reducing the dimension- ality of data and modeling of variance; linear regression, for 2 THE MULTIVARIATE ANALYSIS TOOLBOX formulating input to output transformation models based on super- vised learning which are predictive generally for quantitative FOR PYTHON The PyChem project aims to provide a simple multivariate à To whom correspondence should be addressed. analysis toolbox with a powerful and intuitive GUI front-end. Ó 2006 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commerical License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Bioinformatics. 2006, 22, 2565-2566.
  • 52. R.M.Jarvis et al. stored and the progression of the analysis recorded, which is particularly useful for tracking data analyses as part of GLP. The additional benefit of using the XML data structure for storage is that it introduces the potential for engineering simple bespoke interfaces to database storage systems. PyChem provides simple grid-style user interfaces for the input of experimental and sample metadata, so producing a series of vectors describing the origin and identity of each sample and measured variable. For unsupervised analyses, such as PCA, the software simply requires a vector, or multiple vectors of sample labels for plotting; in addition for supervised analyses, vectors are required to (1) represent putative class structures or some quantitative trait (e.g., level of abiotic or biotic interference) and (2) identify groups in to which the data should be split for the purpose of cross- validation. In supervised analyses the issue of model validation is crucial; when a model is formulated there is a possibility that Fig. 1. A screenshot demonstrating the feature selection functionality it will overfit the data and find a relationship between the data and Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on October 22, 2011 available in PyChem, in this example microarray data (Golub et al., 1999) the target class structure or dependent variables, which does not have been analysed consisting of 72 samples represented by 7070 genes. hold for subsequent predictions; i.e. the model has learnt the training The GA directed search can be used to highlight genes that are particularly data perfectly and is not able to generalize. This situation can be important for discrimination. avoided by performing some form of model validation. In the current version of PyChem (2.0.0) we use the preferred approach The project is implemented in Python and utilizes the of data splitting (Brereton, 2003), which works by dividing the wxPython (http://www.wxpython.org/), Boa Constructor (http:// measured X-variables in to three groups; a model training set, boa-constructor.sourceforge.net/) and SciPy (http://scipy.org/) model cross-validation data and finally an independent test set. packages (see Fig. 1 for an example screenshot) amongst others. The model is trained on the first set, optimized on the second set The software was designed to provide a range of algorithms that and then tested for accuracy on the third set of ‘hold-out’ data. address three fundamental questions commonly asked by the A major emphasis of this work has been in providing clear researcher. and useful graphical reports for the interpretation of results. The GUI uses wxPyPlot (http://www.cyberus.ca/~g_will/wxPython/ (1) What is the shape of the data—including sources of variance wxpyplot.html), with a small modification to include text plotting. and outlier identification? In the future even more focus will be given to the structure of graphical (2) How similar are different samples? reporting in PyChem, as well as the functionality associated with the plotting canvases. Finally, all results, both graphical and numerical, (3) Which measurements from the original data can be attributed can easily be exported from PyChem, with numerical results in ASCII to observed differences and/or similarities? file format to allow for use in other software applications. To help answer these questions, the initial release includes algorithms for the pre-processing of multivariate data (such as ACKNOWLEDGEMENTS scaling, baseline correction, filtering and derivatization), principal R.M.J., D.B., H.J., N.M.O.B. and R.G. would like to thank the components analysis (PCA) (Jolliffe, 1986), partial least squares BBSRC for funding (NMOB; grant BB/C51320X/1). Funding regression (PLS1) (Martens and Naes, 1989), discriminant function to pay the Open Access publication charges for this article was analysis (DFA) (Manly, 1994), cluster analysis [using the C clus- provided by the BBSRC. tering library for Python (http://bonsai.ims.u-tokyo.ac.jp/ ~mdehoon/software/cluster/) (Eisen et al., 1998; de Hoon et al., Conflict of Interest: none declared. 2004)], and a number of genetic algorithm (GA) based tools for performing feature selection (Jarvis and Goodacre, 2005), see Fig. 1. REFERENCES The software is able to handle any 2D dataset where each sample Brereton,R. (2003) Chemometrics: data analysis for the laboratory and chemical plant, is defined by a series of discrete or continuous measurements. Data 1st edn. Chichester: John Wiley Sons Ltd. can be imported from flat ASCII files that use the standard delim- Eisen,M. et al. (1998) Cluster analysis and display of genome-wide expression patterns. iters. Typical data of this type include those generated from Proc. Natl Acad. Sci.USA, 95, 14863–14868. microarrays, proteomics, spectroscopic methods (UV-Vis, infrared de Hoon,M. et al. (2004) Open source clustering software. Bioinformatics, 20, 1453–1454. and Raman), mass spectrometry, NMR, or indeed any data arrays Golub,T. et al. (1999) Molecular classification of cancer: class discovery and class representing samples for which multiple discrete measurements prediction by gene expression monitoring. Science, 286, 531–537. have been acquired. Once data have been imported into PyChem Jarvis,R. and Goodacre,R. (2005) Genetic algorithm optimization for pre-processing they can be saved in an XML format [implemented using cElement- and variable selection of spectroscopic data. Bioinformatics, 21, 860–868. Tree (http://effbot.org/)] as a PyChem experiment, which allows for Jolliffe,I.T. (1986) Principal Component Analysis. Springer-Verlag, New York. Manly,B.F.J. (1994) Multivariate Statistical Methods: A Primer. Chapman Hall/ the subsequent storage of multiple experimental results within a CRC, New York. single file. This allows for the capture of the state of the system Martens,H. and Naes,T. (1989) Multivariate Calibration. John Wiley Sons, at a point in time, so that results of multivariate analyses can be Chichester. 2566 Bioinformatics. 2006, 22, 2565-2566.
  • 53. Chemistry Central Journal Methodology Open Access Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction Noel M O'Boyle*1,2, David S Palmer1,3, Florian Nigsch1 and John BO Mitchell1 Address: 1Unilever Centre for Molecular Science Informatics, Dept. of Chemistry, University of Cambridge, Lensfield Rd, Cambridge, CB2 1EW, UK, 2Cambridge Crystallographic Data Centre, 12 Union Rd, Cambridge, CB2 1EZ, UK and 3Department of Chemistry, Aarhus University, 8000 Aarhus C, Denmark Email: Noel M O'Boyle* - baoilleach@gmail.com; David S Palmer - dsp@chem.au.dk; Florian Nigsch - fn211@cam.ac.uk; John BO Mitchell - jbom1@cam.ac.uk * Corresponding author Published: 29 October 2008 Received: 1 August 2008 Accepted: 29 October 2008 Chemistry Central Journal 2008, 2:21 doi:10.1186/1752-153X-2-21 This article is available from: http://journal.chemistrycentral.com/content/2/1/21 © 2007 O'Boyle et al This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: We present a novel feature selection algorithm, Winnowing Artificial Ant Colony (WAAC), that performs simultaneous feature selection and model parameter optimisation for the development of predictive quantitative structure-property relationship (QSPR) models. The WAAC algorithm is an extension of the modified ant colony algorithm of Shen et al. (J Chem Inf Model 2005, 45: 1024–1029). We test the ability of the algorithm to develop a predictive partial least squares model for the Karthikeyan dataset (J Chem Inf Model 2005, 45: 581–590) of melting point values. We also test its ability to perform feature selection on a support vector machine model for the same dataset. Results: Starting from an initial set of 203 descriptors, the WAAC algorithm selected a PLS model with 68 descriptors which has an RMSE on an external test set of 46.6°C and R2 of 0.51. The number of components chosen for the model was 49, which was close to optimal for this feature selection. The selected SVM model has 28 descriptors (cost of 5, ε of 0.21) and an RMSE of 45.1°C and R2 of 0.54. This model outperforms a kNN model (RMSE of 48.3°C, R2 of 0.47) for the same data and has similar performance to a Random Forest model (RMSE of 44.5°C, R2 of 0.55). However it is much less prone to bias at the extremes of the range of melting points as shown by the slope of the line through the residuals: -0.43 for WAAC/SVM, -0.53 for Random Forest. Conclusion: With a careful choice of objective function, the WAAC algorithm can be used to optimise machine learning and regression models that suffer from overfitting. Where model parameters also need to be tuned, as is the case with support vector machine and partial least squares models, it can optimise these simultaneously. The moving probabilities used by the algorithm are easily interpreted in terms of the best and current models of the ants, and the winnowing procedure promotes the removal of irrelevant descriptors. Page 1 of 15 Chem. Cent. J. 2008, 2, 21. (page number not for citation purposes)
  • 54. Chemistry Central Journal 2008, 2:21 http://journal.chemistrycentral.com/content/2/1/21 Background mate of how well the model will generalise to unseen data Quantitative Structure-Activity and Structure-Property drawn from the same distribution. The purpose of the Relationship (QSAR and QSPR) models are based upon search is to find the feature selection that optimises this the idea, first proposed by Hansch [1], that a molecular value. The most well-known deterministic wrapper is property can be related to physicochemical descriptors of sequential forward selection [11] (SFS) which involves the molecule. A QSAR model for prediction must be able successive additions of the feature that most improves the to generalise well to give accurate predictions on unseen objective function to the subset of descriptors already cho- test data. Although it is true in general that the more sen. A related algorithm, sequential backwards elimina- descriptors used to build a model, the better the model tion [12] (SBE), successively eliminates descriptors predicts the training set data, such a model typically has starting from the complete set of descriptors. Both of these very poor predictive ability when presented with unseen algorithms suffer from the problem of 'nesting'. In the test data, a phenomenon known as overfitting [2]. Feature case of SFS, nesting refers to the fact that once a particular selection refers to the problem of selecting a subset of the feature is added it cannot be removed at a later stage, even descriptors which can be used to build a model with opti- if this would increase the value of the objective function. mal predictive ability [3]. In addition to better prediction, More sophisticated methods, such as the sequential for- the identification of relevant descriptors can give insight ward floating selection (SFFS) algorithm of Pudil et al. into the factors affecting the property of interest. [13], include a backtracking phase after each addition where variables are successively eliminated if this The number of subsets of a set of n descriptors is 2n-1. improves the objective function. Wrapper methods spe- Unless n is small (20) it is not feasible to test every pos- cific to certain models have also been developed. For sible subset, and the number of descriptors calculated by example, the Recursive Feature Elimination algorithm of cheminformatics software is usually much larger (CDK Guyon et al. [14] and the Incremental Regularised Risk [4], MOE [5] and Sybyl [6] can respectively calculate a Minimisation of Fröhlich et al. [15] are specific to models total of 95, 146 and 248 1D and 2D descriptors). Feature built using support vector machines. selection methods can be divided into two main classes: the filter approach and the wrapper approach [3,7,8]. The Stochastic wrappers attempt to deal with the size of the filter approach does not take into account the particular search space by incorporating some degree of randomness model being used for prediction, but rather attempts to into the search strategy. The most well known of these determine a priori which descriptors are likely to contain algorithms is the genetic algorithm [16] (GA), whose useful information. Examples of this approach include search procedure mimics the biological process of evolu- ranking descriptors by their correlation with the target tion. A number of models are created randomly in the first value or by estimates of the mutual information (based on generation, the best of which (as measured by the objec- information theory) between each descriptor and the tive function) are selected and interbred in some way to response. Another commonly used filter in QSAR is the create the next generation. A mutation operator is applied removal of highly correlated (or anti-correlated) descrip- to the new models so that random sampling of the local tors [9]. Liu [10] presents a comparison of five different space occurs. Over the course of many generations, the filters in the context of prediction of binding affinities to objective function is optimised. Genetic algorithms were thrombin. The filter approach has the advantages of speed first used for feature selection in QSAR by Rogers and and simplicity, but the disadvantage that it does not Hopfinger [17] and are now used widely [9,18,19]. Other explicitly consider the performance of the model contain- stochastic methods which have been used for feature ing different features. Correlation criteria can only detect selection in QSAR are particle swarm optimisation [20,21] linear dependencies between descriptor values and the and simulated annealing [22]. response, but the best performing QSAR models are often non-linear (support vector machines (SVM), neural net- An additional difficulty in the development of QSAR works (NN) and random forests (RF), for example). In models is the fact that some regression methods have addition, Guyon and Elisseeff show that very high correla- parameters that need to be optimised to obtain the best tion (or anti-correlation) does not necessarily imply an performance for a particular problem. The Support Vector absence of feature complementarity, and also that two Machine (SVM) is an example of such a method. A SVM is variables that are useless by themselves can be useful a kernel-based machine learning method used for both together [3]. classification and regression [23-25] which has shown very good performance in QSAR studies [9]. In ε-SVM The wrapper approach conducts a search for a good fea- regression, the algorithm finds a hyperplane in a trans- ture selection using the induction algorithm as a black box formed space of the inputs that has at most ε deviation to evaluate subsets and calculate the value of an objective from the output y values. Deviations greater than ε are function. The objective function should provide an esti- penalised by multiplying by a cost value C. The transfor- Page 2 of 15 Chem. Cent. J. 2008, 2, 21. (page number not for citation purposes)
  • 55. Chemistry Central Journal 2008, 2:21 http://journal.chemistrycentral.com/content/2/1/21 mation of the inputs is carried out by means of kernel Since the ANTSELECT algorithm uses only a single ant, it functions, which allows nonlinear relationships between cannot make use of one of the most important features of the inputs and the outputs to be handled by this essen- ant colony algorithms, collective intelligence. Instead, tially linear method. For a particular problem and kernel, premature convergence will occur due to positive rein- the values of C and ε must be tuned. forcement of models that have performed well earlier in the local search. In addition, the search space will be Here we describe WAAC, Winnowing Artificial Ant Col- poorly covered. Although the authors recommend that ony, a stochastic wrapper for feature selection and param- the algorithm should be repeated several times to mini- eter optimisation that combines simultaneous mise the likelihood of convergence to a poor local mini- optimisation of the selected descriptors and the model mum, the use of an ant colony is a much more robust parameters to create a model with good predictive accu- solution. racy. This method does not require any pre-processing of the data apart from removal of zero-variance and dupli- Shen et al. [26] presented an ACO algorithm that differed cate descriptors. The only requirement is that allowed val- from ANTSELECT in several ways. Their algorithm, which ues of parameters of the models must be specified. As a they called a modified ACO, is similar to our WAAC algo- result, this method is suitable for use as an automatic gen- rithm in that it involves a colony of ants, each of which erator of predictive models. remembers its best model and score, as well as its current model and score. In Shen et al.'s algorithm, for every The WAAC algorithm is a novel stochastic wrapper descriptor there are both positive and negative weights. derived from the modified Ant Colony Optimisation The probability that an ant will choose a particular (ACO) algorithm of Shen et al. [26]. Ant colony algo- descriptor is given by the positive weight for that descrip- rithms take their inspiration from the foraging of ants tor divided by the sum of the positive and negative whose cooperative behaviour enables the shortest path weights. After every iteration, the weights are reduced by between nest and food to be found [27]. Ants deposit a multiplying by (1-ρ) as for ANTSELECT. The positive substance called pheromone as they walk, thus forming a weight for a particular descriptor is increased by the sum pheromone trail. At a branching point, an ant is more of the fitness scores of all ants in the current iteration that likely to choose the trail with the greater amount of phe- have selected it, as well as the fitness scores of the best romone. Over time as pheromones evaporate, only those models of all ants that have selected it in that model. Sim- trails that have been reinforced by the passage of many ilarly, the negative weight for a particular descriptor is ants will retain appreciable amounts of pheromone, with decreased by an amount based on the fitness scores of the shortest trail having the greatest amount of pherom- models that have not selected it. one. In the end, all of the ants will travel by the shortest trail. Artificial ant colony systems may be used to solve In the following section, we describe the WAAC algorithm combinatorial optimisation problems by making use of in detail, as well as the dataset and model used to test the the ideas of cooperation between autonomous agents algorithm. In the Results and Discussion sections, we through global knowledge and positive feedback that are describe the performance of the WAAC algorithm, com- observed in real ant colonies [28]. pare it to other models on the same dataset, and discuss some practical considerations in usage. The first use of artificial ant systems for variable selection in QSAR was the ANTSELECT algorithm of Izrailev and Methods Agrafiotis [29]. The ANTSELECT algorithm involves the WAAC algorithm movement of a single ant through feature space. Initially The WAAC algorithm uses a population of candidate equal weights are assigned to each descriptor. The proba- models termed an 'ant colony'. Each ant represents a bility of the ant choosing a particular descriptor in the model; that is, it is associated with a particular feature next iteration is the weight for that descriptor divided by selection as well as particular values for the model (for the sum of all weights. After the fitness of the model is example, SVM) parameters. The set of descriptors is stored assessed, all of the weights are reduced by multiplying by as a binary fingerprint of length F (the number of descrip- (1-ρ), where ρ is the evaporation coefficient. The weights tors), where a value of 1 for the nth bit indicates that the of those descriptors selected in the current iteration are nth descriptor is selected, and 0 indicates that it is not. For then increased by a constant multiple of the fitness score. each parameter of the model, a range of discrete values is Gunturi et al. [30] used a modification of the ANTSELECT required. The parameter values used by a particular ant are algorithm in a recent study of human serum albumin stored in a list of length P, where P is the number of binding affinity in which the number of features selected adjustable parameters of the model. The fitness of each was fixed a priori and, in addition, could not include model is measured using an objective function specified descriptors that had a correlation coefficient greater than by the user. 0.75. Page 3 of 15 Chem. Cent. J. 2008, 2, 21. (page number not for citation purposes)
  • 56. Chemistry Central Journal 2008, 2:21 http://journal.chemistrycentral.com/content/2/1/21 The initial population of ants is randomly placed in fea- moving probability is used to determine the chance that a ture and parameter space. The bits of the binary finger- particular ant will select a particular descriptor in the next prints representing the feature selections are initialised to iteration. At the start of the optimisation phase, the mov- either 0 or 1 with equal probability, so that on average ing probabilities for all of the descriptors will be approxi- each ant corresponds to a model based on approximately mately equal to 0.5 (since the best model will be the 50% of the descriptors. Conversely, each descriptor is ini- current model and each descriptor is selected by approxi- tially selected by approximately 50% of the ants. The ini- mately 50% of the ants). tial parameter values for each ant are chosen at random from the available values for each parameter. Similarly, for each parameter there is a moving probabil- ity associated with every allowed value. These moving Figure 1 shows a schematic of the WAAC algorithm. After probabilities sum to unity (since each ant needs to select initialisation, the algorithm enters the optimisation exactly one allowed value for each parameter), and are cal- phase. For each descriptor, a moving probability is calcu- culated by taking the average of the fraction of ants which lated by taking the average of the fraction of ants which have currently selected a particular allowed value and the have currently selected that descriptor and the fraction fraction of ants that have selected that value in their best that have selected that descriptor in their best model. This model. At the start of the optimisation phase, each allowed value of a parameter will be selected by approxi- mately N/P ants where N is the number of ants, and P the number of allowed values. At the start of the optimisation phase, the ants move more or less randomly, as the moving probabilities are essen- tially equal for all features and parameter values. How- ever, over the course of the optimisation phase as particular descriptors are found to occur frequently in the best models associated with the ants, due to positive feed- back these descriptors will be more likely to be chosen in subsequent iterations. This global optimisation procedure is combined with local optimisation due to the influence of the current positions of the ants on the moving proba- bilities. Note that the ants do not move about relative to their position in a previous iteration; rather, their subse- quent location in feature space is determined by the best and current feature selections of all of the ants. Note that nesting is not a problem, as in each step of the optimisa- tion the ants are free to explore descriptor combinations which did not exist in the previous step. After multiple iterations of the optimisation algorithm, a winnowing procedure is applied. This reduces the search space by retaining only those descriptors that have been chosen by at least 20% of the ants in their best models, and removing the rest. Parameter values are reinitialised randomly. Some descriptors may be retained that do not improve the models, but the subsequent reinitialisation of the ants on the smaller search space will allow the sub- sequent optimisation phase to identify better models which exclude that descriptor. Note that no information is carried from one optimisation procedure to the next. In particular, memory of previous best models does not guide future searching. This means that the randomly ini- tialised models in the new optimisation phase are always Figure of Outline 1 the WAAC algorithm poorer than the best models of the previous phase, but the Outline of the WAAC algorithm. reduction in the size of the feature space means that the Page 4 of 15 Chem. Cent. J. 2008, 2, 21. (page number not for citation purposes)
  • 57. Chemistry Central Journal 2008, 2:21 http://journal.chemistrycentral.com/content/2/1/21 performance of the model quickly recovers and matches or improves on earlier performance. n ∑( y ) 1 2 RMSE = obs i − y ipred (1) As shown in Figure 1, the optimisation phase and win- n i =1 nowing procedure are repeated until convergence is achieved or a specific number of iterations have occurred. n n ∑( y ) /∑( y ) The best model found at any point in the entire optimisa- 2 2 R2 = 1 − obs i − y ipred obs i − y obs tion procedure should be chosen as the final best model. i =1 i =1 An implementation of WAAC in R [31] is available from the authors on request. (2) n ∑( y ) Dataset 1 We use the Karthikeyan dataset [32] of melting point val- bias = obs i − y ipred (3) n ues as described in Nigsch et al. [33]. This is a dataset of i =1 melting points of 4119 diverse organic molecules which In the prediction of the external test set, an outlier is cover a range of melting points from 14 to 392.5°C, with defined as any point with a residual greater than 4 stand- a mean of 167.3°C and a standard deviation of 66.4°C. ard deviations from the mean. Each molecule is described by 203 2D and 3D descriptors, which is the full range of descriptors available in the soft- Models ware MOE 2004.03 [5]. We used the WAAC algorithm to simultaneously optimise the chosen features and number of components in a Par- The dataset was randomly divided 2:1 into training data tial Least Squares (PLS) model. The plsr method in the pls and an external test set (1373 molecules, see additional package in R [31] was used to build the PLS model. Scal- file 1: externaltest.csv for the original data). The training ing was set to true. A range of 20 allowed parameter values data was further randomly divided 2:1 into a training set for the number of components in the model was initially used for model building (1831 molecules, see additional set to cover from 1 to 191 inclusive in steps of 10. After file 2: internaltraining.csv) and an internal test set (915 each winnowing, the step size was reset so that the maxi- molecules, see additional file 3: internaltest.csv). mum value for the number of components was less than the number of remaining descriptors. For the WAAC algo- Objective function rithm itself, a colony of 50 ants was used, and the algo- The goal of the WAAC algorithm is to find the feature sub- rithm was run for 800 iterations with winnowing every set and parameter values that will give the best predictive 100 iterations. For comparison, the algorithm was run for accuracy for a model based on given training data. During the same length without any winnowing. the course of the optimisation, the algorithm needs to be guided by an objective function that will give an estimate In addition, we used the WAAC algorithm to optimise a of the predictive accuracy of a particular model. Support Vector Machine (SVM) model. The svm method in the e1071 package in R [31] was used to perform ε- Here we examine the performance of the WAAC algorithm regression with a radial basis function. A range of allowed on the Karthikeyan dataset using as our objective function parameter values for the SVM were chosen based on a pre- the root mean squared error of the predictions on the liminary run: values for C from 1 to 31 inclusive in steps internal test set, RMSE(int). Each model is built on the of 2, and values of ε from 0.01 to 1.61 inclusive in steps training set using whatever features and parameter values of 0.1. Since two parameters needed to be optimised for have been selected, and then used to predict the melting this model, the length of each optimisation phase in the point values for the internal test set. WAAC algorithm was extended to 150 iterations and the algorithm was run for 1500 iterations in total. Statistical testing To assess the quality of a model, we report three statistics: To compare to other feature selection methods, we used the squared correlation coefficient, R2, the Root-Mean- the training data to build a Random Forest model [34] Square-Error, RMSE, and the bias. These are defined in using the randomForest package in R (using the default set- Equations 1 to 3. A parenthesis nomenclature is used to tings of mtry = N/3, ntree = 500, nodesize = 5). We also indicate whether the statistic refers to a model tested on compared to the best of thirteen k Nearest Neighbours the entire training data (tr) (this includes the internal test (kNN) models trained on the training set, where k was 1, set), the internal test set only (int), or the external test set 5, 10 or 15. For the models based on multiple neighbours, (ext). separate models were created where the predictions were combined using exponential, geometric, arithmetic, or inverse distance weighting (for more details, see Nigsch et Page 5 of 15 Chem. Cent. J. 2008, 2, 21. (page number not for citation purposes)
  • 58. Chemistry Central Journal 2008, 2:21 http://journal.chemistrycentral.com/content/2/1/21 al. [33]). The best performing model, as measured by applied to the selected chromosomes, as a single-point leave-one-out cross validation on the training data, was crossover between randomly selected (with replacement) the 15 NN model with exponential weighting. Hereafter, chromosomes yielding a pair of children in each case. this model is referred to as the kNN model. Each child was subject to a mutation operator which, for a given bit on a chromosome, had a probability of 0.04 of Genetic algorithm flipping it. The process of crossover and mutation was For comparison with the WAAC algorithm, a genetic algo- repeated until 50 offspring were created. The next genera- rithm for feature selection was implemented in the R sta- tion was then formed by the 25 best chromosomes in the tistical programming environment [31]. 50 chromosomes original population along with the best 25 of the off- were randomly initialised so that each chromosome on spring. average corresponded to a model based on half of the descriptors. A selection operator chose 10 chromosomes Results using tournament selection with tournaments of size 3. The WAAC algorithm was used to search parameter and Once selected, that chromosome was removed from the feature space for a predictive SVM model for the pool for further selection. A crossover operator was Karthikeyan dataset for both a PLS model and an SVM Valuemodel (bottom) function for the best model at each iteration of the WAAC algorithm for the PLS model (top) and the Figure the SVM of2 objective Value of the objective function for the best model at each iteration of the WAAC algorithm for the PLS model (top) and the SVM model (bottom). The figures on the right, (b) and (d), show the effect of having a single optimisation phase without any winnowing. Ten repetitions of the algorithm are shown, with corresponding repetitions starting from the same initial random seed. Page 6 of 15 Chem. Cent. J. 2008, 2, 21. (page number not for citation purposes)
  • 59. Chemistry Central Journal 2008, 2:21 http://journal.chemistrycentral.com/content/2/1/21 model. Figures 2(a) and 2(c) show the progress of the model. The final models were evaluated by training on the algorithm for the PLS and SVM models respectively, as entire training data of 2746 molecules, and predicting the measured by the value of the objective function for the melting point value of the external test set. The results are best model found so far in a particular optimisation shown in Figure 3 and summarised in Table 2. The sum- phase. Each experiment was performed 10 times with dif- mary statistics for the PLS model are: for the training set, ferent random seeds. For each repetition, the model with RMSE(tr) = 44.4°C, R2(tr) = 0.52, bias = -0.0°C; for the the lowest value of the objective function was chosen test set, RMSE(ext) = 46.6°C, R2(ext) = 0.51, bias = - from among the best models found in each optimisation 0.74°C. For comparison, the value of the objective func- phase. Of these ten models, the one with the fewest tion RMSE(int) was 42.8°C. There was a single outlier, descriptors was chosen as the single final model. This mol4161 (Figure 4). The summary statistics for the SVM reduces the possibility of finding by chance a model model are: for the training set, RMSE(tr) = 30.7°C, R2(tr) which had an optimal value of the objective function but = 0.77, bias = -1.6°C; for the test set, RMSE(ext) = 45.1°C, poor predictive ability. R2(ext) = 0.54, bias = -2.1°C. The value of the objective function RMSE(int) was 40.2°C. Three molecules were The selected models for WAAC/PLS and WAAC/SVM are identified as outliers to the model: mol41, mol4161 and shown in Table 1. Of the 203 original descriptors, only 68 mol4195. These are drawn as filled circles in Figure 3, and were selected for the PLS model, and 28 for the SVM their structures are shown in Figure 4. Figure 3 Performance of models developed with WAAC: (a) a PLS model and (b) an SVM model Performance of models developed with WAAC: (a) a PLS model and (b) an SVM model. The first two columns contain predictions for the training set and test set, respectively. The line x = y is shown for comparison. The column on the right shows the residuals from the test set prediction along with a line of best fit (light line); for comparison, the line x = 0 is shown (heavy line). Outliers are shown as filled circles in the test set prediction and residuals plots. All values in °C. Page 7 of 15 Chem. Cent. J. 2008, 2, 21. (page number not for citation purposes)
  • 60. Chemistry Central Journal 2008, 2:21 http://journal.chemistrycentral.com/content/2/1/21 calculated the value of the objective function, RMSE(int). As shown in Figure 5 (solid line), the value of the objec- tive function obtained with 49 components is almost at the minimum, although three larger values for the number of components give slightly better models (42.78°C RMSE(int) versus 42.73°C). For the SVM model, the optimised parameter values associated with the selected model were a cost value of 5, and a value for ε of 0.21. When we carried out a parameter scan across all allowed values of the cost and ε (272 models in total), only one scored higher than the best model, and even then, only marginally: 40.22°C RMSE(int) for cost = 5 and ε = 0.11, versus 40.23°C for the best model. Figure 2(a) shows the value of the objective function for the best PLS model at each iteration for the WAAC algo- rithm compared to a single optimisation phase without any winnowing, Figure 2(b). The same random seeds are used for corresponding repetitions of the experiments, to ensure that the effect observed is not due to different ini- tial models. In the absence of winnowing, premature con- vergence occurs and poorer solutions are found. This is also the case for the best SVM model shown in Figure 2(c) and 2(d). The Random Forest (RF) and kNN models for the same data are shown in Figure 6 and Table 1. Although per- formance on the training set does not give any indication of predictive ability, it is interesting to note how the differ- ent models have completely different RMSE(tr) and R2(tr). Performance on the external test set, which was not used to derive any of the models, allows us to assess pre- dictive ability. On the basis of RMSE(ext), the RF model (44.5°C) is as good as, or slightly better than, the WAAC/ SVM model (45.1°C), followed by the WAAC/PLS model (46.6°C) and then the kNN model (48.3°C). A similar order of predictive ability is shown by R2(ext), (RF: 0.55, WAAC/SVM: 0.54, WAAC/PLS: 0.51, kNN: 0.47). The bias shows a slightly different order for the two WAAC-derived models (RF: -0.4°C, WAAC/PLS: -0.7°C, WAAC/SVM: - Figure 4 Structures of outliers for the models discussed in the text 2.1°C, kNN: -4.1°C). Structures of outliers for the models discussed in the text. An outlier is defined as any molecule with a residual However, looking at the test set predictions in the second greater than four standard deviations from the mean. Mole- column of Figures 3 and 6 it is clear, particularly for the RF cules 41, 4161 and 4195 are outliers for the WAAC/SVM model; molecules 4161 and 4208 are outliers for both the RF model, that a systematic error occurs at the extremes of the and kNN models; molecule 4161 is the single outlier to the melting point values in the dataset: low values are system- WAAC/PLS model. atically overpredicted, while high values are underpre- dicted. In order to quantify the extent of this problem, we plotted the test set residuals versus the experimental melt- ing point, and used linear regression to find the line of For the PLS model the optimised number of components best fit (shown in the third column in Figures 3 and 6). was 49. In order to assess whether the WAAC algorithm For a model without this type of predictive bias, the sufficiently explored parameter space, we carried out a expected slope is 0. The WAAC/SVM model performs best parameter scan across all allowed values for the parameter with a slope of -0.43, followed by the kNN and WAAC/ with the feature selection found in the best model, and PLS models which both have slopes of -0.49, while the RF Page 8 of 15 Chem. Cent. J. 2008, 2, 21. (page number not for citation purposes)
  • 61. Chemistry Central Journal 2008, 2:21 http://journal.chemistrycentral.com/content/2/1/21 Table 1: Description of the best models found by the WAAC algorithm WAAC/PLS WAAC/SVM Number of descriptors 68 28 2D descriptors petitjean, weinerPath, weinerPol, a_ICM, radius, weinerPol, b_1rotR, b_rotR, chi1v_c, a_nO, a_nP, balabanJ, b_1rotR, chi0_C, chi1, reactive, a_heavy, PEOE_VSA+2, PEOE_VSA+3, PEOE_VSA-1, PEOE_VSA-5, PEOE_VSA-6, a_nH, a_nF, a_nO, a_nS, VadjEq, VadjMa, Q_RPC+, SlogP_VSA1, SlogP_VSA4, SlogP_VSA9, SMR_VSA2, SMR_VSA4, balabanJ, PEOE_RPC+, PEOE_VSA+3, SMR_VSA6, TPSA PEOE_VSA+4, PEOE_VSA+5, PEOE_VSA+6, PEOE_VSA-1, PEOE_VSA-4, PEOE_VSA_FPNEG, PEOE_VSA_PPOS, PC+, PC-, Q_PC+, Q_RPC+, Q_VSA_FHYD, Q_VSA_FNEG, Q_VSA_FPNEG, Q_VSA_FPOL, Q_VSA_FPOS, Q_VSA_FPPOS, Q_VSA_PNEG, Q_VSA_PPOS, Kier1, Kier3, KierA1, KierA2, apol, vsa_acc, SlogP_VSA3, SlogP_VSA5, SMR_VSA3, SMR_VSA5, TPSA 3D descriptors AM1_dipole, AM1_Eele, E_sol, E_strain, E_oop, E_strain, E_vdw, PM3_LUMO, FASA_P, FCASA+, rgyr E_tor, MNDO_HF, MNDO_dipole, MNDO_E, dipole, PM3_HF, ASA-, ASA_H, CASA-, FASA_H, FASA_P, VSA, glob, std_dim1, std_dim3, vol Parameters components = 49 Cost = 5, ε = 0.21 model has a slope of -0.53. The standard errors of all of WAAC/PLS, WAAC/SVM, RF and kNN models respec- these values are 0.01. tively. However, for the RF model the standard deviation of the predicted values is much smaller than that of the Another effect of this systematic error is that the predicted other models: 47.1, 51.6, 41.0 and 49.5°C for the WAAC/ values are bunched closer around the mean than the PLS, WAAC/SVM, RF and kNN models respectively. experimental values. The mean and standard deviation of the experimental values in the test set are 167.3°C and Another widely used stochastic method for feature selec- 66.4°C, respectively. All of the model predictions have a tion is a genetic algorithm (GA). Hasegawa et al. [35] were similar mean: 166.5, 165.2, 167.0 and 163.2°C for the one of the first to use a GA in combination with a PLS Table 2: Summary statistics for the models discussed in the text WAAC/PLS WAAC/SVM SVM kNN Random Forest Training set RMSE (°C) 44.4 30.7 36.2 47.6 17.8 (44.7)* R2 0.52 0.77 0.68 0.44 0.92 (0.51)* bias (°C) 0.0 -1.6 -2.3 -3.4 0.0 Test set RMSE (°C) 46.6 45.1 43.9 48.3 44.5 R2 0.51 0.54 0.56 0.47 0.55 bias (°C) -0.7 -2.1 -2.3 -4.1 -0.4 mean (°C) 166.5 165.2 165.0 163.2 167.0 standard deviation (°C) 47.1 51.6 49.3 49.5 41.0 Line of best fit through test set residuals Slope -0.49 -0.43 -0.44 -0.49 -0.53 * Out-of-bag estimates for RMSE and R2 are shown in parenthesis. Page 9 of 15 Chem. Cent. J. 2008, 2, 21. (page number not for citation purposes)
  • 62. Chemistry Central Journal 2008, 2:21 http://journal.chemistrycentral.com/content/2/1/21 supposed to help strike a balance between exploitation of information on previous models (global search) and exploration of local feature space (local search). However, this aspect is already included in Shen et al.'s algorithm and WAAC by the influence of the best models (global search) and current models (local search) on the moving probabilities. As a result of this simpler approach, the moving probabilities now have a meaningful interpreta- tion: the probability of choosing a particular descriptor in the next iteration is equal to the average of the fraction of ants that have chosen that descriptor in their current model and the fraction of ants that have chosen it in their best model. Since the WAAC algorithm requires a range of allowed parameter values for the model, it is generally worthwhile Figure PLS model ability of5a of the number of components on the predictive The effect to do an exploratory run of the algorithm to determine The effect of the number of components on the pre- reasonable values. In addition, it is important that the dictive ability of a PLS model. The red dashed line is a number of allowed values for each parameter is less than model based on all of the features, whereas the model repre- the number of ants (preferably much less) to ensure that sented by the blue solid line is based only on the subset the parameter space is adequately sampled. An appropri- selected by the WAAC algorithm. The best subset line ends ate size for the ant population depends on the number of at 59 components, as there are only 59 features in this sub- descriptors and the extent of the interaction between set. The line for all features is truncated at 174 components them. Model space will be better sampled if more ants are as the RMSE rapidly increases after this point. used, but the calculation time will also increase. However, since the feature-selection space is of size 2n-1, where n is the number of descriptors, the exact number of ants is not model to perform feature selection. The performance of expected to affect the ability of the algorithm to find solu- the GA for feature selection is shown in Figure 7 compared tions. An ant population of between 50 and 100 ants is to the WAAC algorithm. For both algorithms, the number recommended. For the WAAC/PLS study, the relationship of PLS components was fixed at 49. Convergence is much between the population size and the best value of the slower for the GA algorithm. In addition, the model with objective function is shown in Figure 8; there is little the fewest number of descriptors from 10 repetitions of improvement beyond 50 ants. The length of the optimisa- each algorithm had 95 descriptors in the case of GA/PLS tion phase should be sufficient to allow the objective (objective function of 42.6°C) but only 57 for WAAC/PLS function to start to converge to an optimum value. It is not (objective function value of 42.3°C). necessary to allow the optimisation phase to proceed much further, as after this point the descriptors chosen in Discussion the best models reinforce themselves and broad sampling The development of the WAAC algorithm arose from an of the search space no longer occurs. The winnowing pro- attempt to overcome the limitations of the modified ACO cedure and subsequent reinitialisation on a smaller search and ANTSELECT algorithms. Both of these algorithms space is a more effective way of finding the optimum determine probabilities by summing weights based on fit- model. ness scores. However, we observed that as convergence is achieved the fitness scores of the ant models in a particu- In the past, the development and comparison of feature lar iteration differ very little from each other. Thus, WAAC selection methods for QSAR have involved the use of a uses the fraction of the number of ants that have chosen a standard dataset first reported in 1990, the Selwood data- particular descriptor rather than a function of the fitness set [36] of the activity of 31 antifilarial antimycin ana- of the ants that have chosen that feature. Another problem logues, whose structures are represented by 53 calculated with the use of weights is that they increase monotonically physicochemical descriptors. However, comparisons over the course of the algorithm whereas the sum of the between different algorithms have been hampered by the number of ants has a clear bound. In addition, WAAC uses fact that many of the descriptors are highly-correlated, a value for ρ of 1, that is, complete evaporation. Values and in addition, a true test using an external test set is not less than 1 were found to delay convergence without any feasible due to the small number of samples. Advances in corresponding improvement in the result. This makes computing power mean that it is no longer appropriate to sense when we consider that the evaporation parameter is use such a small dataset for the purposes of testing feature Page 10 of 15 Chem. Cent. J. 2008, 2, 21. (page number not for citation purposes)
  • 63. Chemistry Central Journal 2008, 2:21 http://journal.chemistrycentral.com/content/2/1/21 Performance of (a) a kNN model, and (b) a Random Forest model Figure 6 Performance of (a) a kNN model, and (b) a Random Forest model. The first two columns contain predictions for the training set and test set, respectively. The line x = y is shown for comparison. The column on the right shows the residuals from the test set prediction along with a line of best fit (light line); for comparison, the line x = 0 is shown (heavy line). Outliers are shown as filled circles in the test set prediction and residuals columns. All values in °C. selection algorithms. The Karthikeyan dataset used here is then the performance of the PLS method is likely to suffer. much more representative of the feature selection prob- This may explain why, despite containing fewer than half lems that occur in modern QSAR and QSPR studies. the number of descriptors, the SVM model performed bet- ter than the PLS model. PLS models are prone to overfitting. Figure 5 shows a comparison between a PLS model that uses the best subset Although the WAAC algorithm is capable of simultane- (as selected by WAAC) and one using all of the descrip- ously optimising the feature selection as well as the tors. It is clear that the development of a predictive PLS parameter values, in some instances it may be preferable model requires a variable selection step. Even if the to use the WAAC algorithm simply for feature selection number of components is optimised, performance is sig- and optimise the parameter values separately for each nificantly poorer if all features are used instead of just the model. This will only be computationally feasible where subset selected by the WAAC algorithm. It is also worth the model has a small number of parameters which need noting that PLS is a linear method, whereas SVM is a non- to be optimised and where the parameter optimisation linear method. If the underlying link between descriptor can be efficiently carried out. For example, the optimal values and the melting point cannot be adequately number of components for a PLS model could be deter- described by a linear combination of descriptor values, mined by internal cross validation. When compared to Page 11 of 15 Chem. Cent. J. 2008, 2, 21. (page number not for citation purposes)
  • 64. Chemistry Central Journal 2008, 2:21 http://journal.chemistrycentral.com/content/2/1/21 Figure the Value of7 objective function for the best PLS model at each iteration of (a) a genetic algorithm and (b) the WAAC algorithm Value of the objective function for the best PLS model at each iteration of (a) a genetic algorithm and (b) the WAAC algorithm. Ten repetitions of each algorithm are shown. The number of PLS components was set to 49. the use of a genetic algorithm for optimising the feature ent implementations as well as several parameters. This selection of a PLS model, the WAAC algorithm performs result, on a single dataset, cannot therefore be seen as con- well, both in terms of faster convergence and in its ability clusive. to produce models with fewer descriptors. It should be noted, however, that genetic algorithms have many differ- In comparison to PLS models, the inclusion of a large number of descriptors does not necessarily lead to overfit- ting for SVM models. Although both Guyon et al. [14] and Fröhlich et al. [15], for example, have developed descrip- tor selection methods for SVM, an SVM model built on the entire set of descriptors and using the optimized parameters from the WAAC algorithm actually performs slightly better on the external test set. Here, the main effect of the WAAC algorithm is the identification of a mini- mum subset of descriptors which are the most important for the development of a predictive model. Such a proce- dure is especially useful when the descriptor values are derived from experimental measurement or require expensive calculation (for example, those derived from QM calculations). It also aids interpretability of the results. Of the 28 descriptors selected by the WAAC/SVM model, Figure the value of 8 between the population WAAC/PLS model Relationship objective function for thesize and the minimum three-quarters are 2D descriptors. Of these, many involve Relationship between the population size and the the area of the van der Waals surface associated with par- minimum value of the objective function for the ticular property values. For example the PEOE_VSA+2 WAAC/PLS model. The value of the objective function is descriptor is the van der Waals surface area (VSA) associ- the minimum found from ten repetitions of the algorithm. ated with PEOE (Partial Equalisation of Orbital Electron- egativity) charges in the range 0.10 to 0.15. Also selected Page 12 of 15 Chem. Cent. J. 2008, 2, 21. (page number not for citation purposes)
  • 65. Chemistry Central Journal 2008, 2:21 http://journal.chemistrycentral.com/content/2/1/21 were descriptors relating to hydrophobic patches on the mol4195, m.p. 342°C, but predicted 111°C. Both of these VSA (SlogP_VSA1, for example), the contribution to molecules have extended conjugated structures, causing molar refractivity (SMR_VSA2, for example) which is the molecule to be planar over a wide area, and which are related to polarisability, and the polar surface area (TPSA). likely to give rise to extensive π-π stacking in the solid Since the intermolecular interactions in a crystal lattice are state. As a result, they are conformationally less flexible dependent on complementarity between the properties of than might be expected from the number of rotatable the VSA of adjacent molecules, the selection of these bonds. mol4161 is also an outlier to the other three mod- descriptors seems reasonable. Two descriptors were els; for WAAC/PLS it is the only outlier, whereas the RF selected relating to the number of rotatable bonds and kNN predictions have a second outlier, mol4208 (Fig- (b_1rotR and b_rotR). These properties are related to the ure 4). melting point through their effect on the change in entropy (ΔSfus) associated with the transformation to the The WAAC algorithm described here is particularly useful solid state. Hydrogen bonds make an important energetic when a machine learning method is prone to overfitting if contribution to the formation of the crystal structure. This presented with a large number of descriptors, such as is probably explains the selection of the descriptor for the the case with PLS. However, not all machine learning number of oxygen atoms (a_nO), although strangely the methods require a prior feature selection procedure. The number of nitrogen atoms is not included (it was however Random Forest (RF) method of Breiman uses consensus included in five out of the ten ant models). Four descrip- prediction of multiple decision trees built with subsets of tors were selected by all ten ant models: b_1rotR, the data and descriptors to avoid overfitting. For compar- SlogP_VSA1, PEOE_VSA-6 and balabanJ. Balaban's J ison with the WAAC results, we predicted the melting index is a topological index that increases in value as a point values for the external test data using an RF model molecule becomes more branched [37]. It seems possible built on the training data. We also compared to a 15 Near- that increased branching makes packing more difficult, est Neighbour model (kNN) where the predictions of the and leads to lower melting points. set of neighbours were combined using an exponential weighting. In our comparison, the RMSE(ext) and R2(ext) The WAAC algorithm appears to be robust to the presence show that the RF and WAAC/SVM models are very similar, of highly correlated descriptors. Despite the fact that such and are better than the WAAC/PLS and kNN models. descriptors were not filtered from the dataset, the selected However, analysis of the residuals shows that the RF is WAAV/SVM model contains only two pairs of descriptors more prone to bias at high and low values of the melting with an absolute Pearson correlation coefficient greater point compared to the other models. than 0.8: b_rotR/b_1rotR (0.97) and SMR_VSA2/ PEOE_VSA-5 (0.81). If the WAAC algorithm were unable A predictive bias was observed for all models at the to filter highly correlated descriptors, we would expect to extremes of the range of melting points. A similar effect see many more correlations as 16 of the chosen descrip- was observed by Nigsch et al. for a kNN model of melting tors were highly correlated (absolute value greater than point prediction [33]. The effect was attributed to the fact 0.8) with at least one descriptor not included in the final that the density of points in the training set is less at the model. For example, radius has a correlation of 0.86 with extremes of the range of melting point values. This means respect to diameter (not unexpectedly). weinerPol is that the nearest neighbours to a point near the extreme are highly correlated with 35 other descriptors, none of which more likely to have melting points closer to the mean. were chosen in the final model. PM3_LUMO is correlated This effect is most pronounced for the RF model, and the with both AM1_LUMO (0.97) and MNDO_LUMO explanation may be similar. (0.96), but neither of other two appear. In this study the WAAC algorithm was guided using the For a small number of molecules, our models make very RMSE of prediction for an internal test set, RMSE(int). The poor predictions. This may either be due to a lack of suffi- choice of which objective function to use should be con- cient training molecules with particular characteristics, or sidered carefully. If an objective function is chosen which it may be due to a fundamental deficiency in the informa- does not explicitly penalise the number of descriptors but tion used to build the models. For example, for the only does so implicitly (for example, RMSE(int)), irrele- WAAC/SVM models, three outliers can be detected whose vant descriptors may accumulate in the converged model. residuals are more than four standard deviations from the When using such an objective function, the winnowing mean (Figures 3 and 4). A polyfluorinated amide, mol41, procedure implemented in WAAC plays an important role is predicted to have a melting point of 233°C although its in removing these descriptors after the optimisation phase experimental melting point is 44°C. The melting points of by initiating a new search of a reduced feature space which the other two outliers were both underestimated: makes it less likely that irrelevant descriptors will be mol4161, m.p. 314.5°C but predicted 119°C, and selected. This effect is shown in Figure 2(b) and 2(d), Page 13 of 15 Chem. Cent. J. 2008, 2, 21. (page number not for citation purposes)
  • 66. Chemistry Central Journal 2008, 2:21 http://journal.chemistrycentral.com/content/2/1/21 where poorer models were found when the WAAC feature We have presented WAAC, an extension of the modified selection and parameter optimisation procedure was ACO algorithm of Shen et al. [26], which can perform applied without winnowing. simultaneous optimisation of feature selection and model parameters. In addition, the moving probabilities used by An alternative type of objective function is one that explic- the algorithm are easily interpreted in terms of the best itly penalises the number of descriptors. Such functions and current models of the ants, and our winnowing pro- typically contain a cost term which is adjusted based on cedure promotes the removal of irrelevant descriptors. some a priori knowledge of the number of descriptors desired in the model. For example, the modified ACO We have shown that the WAAC algorithm can be used to algorithm of Shen et al. [26] was guided by a fitness func- simultaneously optimise parameter values and the tion with two terms, one relating to the number of selected features for PLS and SVM models for melting descriptors and the other to the fit of the model to the point prediction. In particular, the resulting SVM model training set. Objective functions such as this quickly force based on 28 descriptors performed as well as a Random models into a reduced feature space by favouring models Forest model that used the entire set of 203 descriptors. with fewer descriptors. However, the moving probabilities used to choose descriptors will be misleading as they will Authors' contributions largely be based on those descriptors present in models NMOB conceived and developed the WAAC algorithm, with fewer descriptors rather than those with the best pre- applied it to the melting point dataset, analysed the results dictive ability. As a result, descriptors with good predictive and drafted the manuscript. DSP was involved in the ability may be removed by chance. It should be noted that interpretation of the results, revising the manuscript and an objective function that simply optimises a measure of carried out the Random Forest calculations. FN imple- fit to the training data is not a suitable choice for the mented the kNN model. JBOM contributed to the analysis development of a model with predictive ability. Optimis- of data and revising the manuscript. All authors read and ing the RMSE on the entire training data, RMSE(tr), or approved the final manuscript. optimising the R2(tr) value, will produce an overfitted model that fits the training data exceptionally well but Additional material performs poorly on unseen data. Near the end of each optimisation phase, the majority of Additional file 1 The external test set. The models were evaluated by testing on this exter- ants converge to the same feature selection and parameter nal test set. values, causing the same model to be repeatedly evalu- Click here for file ated. It should be possible to gain a significant speedup if [http://www.biomedcentral.com/content/supplementary/1752- instead of re-evaluating a model, a cached value were 153X-2-21-S1.csv] used. Caching could be simply done by storing the objec- tive function and models for all of the ants from the last Additional file 2 few iterations. This is especially important if an objective The internal training set. The WAAC feature selection algorithm was trained on this. function is used whose value varies on re-evaluation as is Click here for file the case, for example, with the RMSE from n-fold cross- [http://www.biomedcentral.com/content/supplementary/1752- validation, RMSE(cv). Since for each ant the best score is 153X-2-21-S2.csv] retained, the value of the objective function will tend towards the optimistic tail of the distribution of values of Additional file 3 the RMSE(cv). However, it should not have a major effect The internal test set. The objective function used to guide the WAAC fea- on the results of the feature selection and parameter opti- ture selection algorithm was calculated using this internal test set. Click here for file misation, as model re-evaluation generally occurs only [http://www.biomedcentral.com/content/supplementary/1752- once the majority of the ants' models have already con- 153X-2-21-S3.csv] verged. Conclusion The key elements to developing an effective QSPR model Acknowledgements for prediction are accurate data, relevant descriptors and We thank the BBSRC (NMOB and JBOM – grant BB/C51320X/1), Pfizer an appropriate model. Where there is no a priori informa- (DSP and JBOM – through the Pfizer Institute for Pharmaceutical Materials tion available on relevant descriptors, some form of fea- Science), and Unilever for funding FN and JBOM and for supporting the ture selection needs to be performed. Centre for Molecular Science Informatics. NMOB thanks Dr. Jen Ryder, Dr. Daniel Almonacid and Dr. Avril Coghlan for helpful comments on the man- uscript. Page 14 of 15 Chem. Cent. J. 2008, 2, 21. (page number not for citation purposes)
  • 67. Chemistry Central Journal 2008, 2:21 http://journal.chemistrycentral.com/content/2/1/21 References 27. Goss S, Aron S, Deneubourg JL, Pasteels JM: Self-organized short- 1. Hansch C, Maloney PP, Fujita T, Muir RM: Correlation of biologi- cuts in the Argentine ant. Naturwissenschaften 1989, 76:579-581. cal activity of phenoxyacetic acids with Hammett substitu- 28. Dorigo M, Di Caro G, Gambardella LM: Ant algorithms for dis- ent constants and partition coefficients. Nature 1962, crete optimization. Artif Life 1999, 5:137-172. 194:178-180. 29. Izrailev S, Agrafiotis DK: Variable selection for QSAR by artifi- 2. Hawkins DM: The problem of overfitting. J Chem Inf Comput Sci cial ant colony systems. SAR QSAR Environ Res 2002, 13:417-423. 2004, 44:1-12. 30. Gunturi SB, Narayanan R, Khandelwal A: In silico ADME model- 3. Guyon I, Elisseeff A: An introduction to variable and feature ling 2: Computational models to predict human serum albu- selection. J Mach Learn Res 2003, 3:1157-1182. min binding affinity using ant colony systems. Bioinorg Med 4. Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E: Chem 2006, 14:4118-4129. The Chemistry Development Kit (CDK): An Open-Source 31. R: A Language and Environment for Statistical Computing Java Library for Chemo- and Bioinformatics. J Chem Inf Comput 2006 [http://www.R-project.org]. R Foundation for Statistical Com- Sci 2003, 43:493-500. puting, Vienna, Austria 5. MOE (Molecular Operating Environment), v2004.03 [http://www.chem 32. Karthikeyan M, Glen RC, Bender A: General melting point pred- comp.com]. Chemical Computing Group Inc., Montreal, Quebec, ication based on a diverse compound data set and artificial Canada neural networks. J Chem Inf Model 2005, 45:581-590. 6. 2006 [http://www.tripos.com]. SYBYL 7.1. Tripos Inc., 1699 Hanley 33. Nigsch F, Bender A, van Buuren B, Tissen J, Nigsch E, Mitchell JBO: Road, St. Louis, MO 63144 Melting point prediction employing k-nearest neighbour 7. John GH, Kohavi R, Pfleger K: Irrelevant features and the subset algorithms and genetic parameter optimization. J Chem Inf selection problem. In Machine learning, Proceedings of the Eleventh Model 2006, 46:2412-2422. International Conference: 10–13 July 1994; Amherst Edited by: Cohen 34. Breiman L: Random Forests. Mach Learn 2001, 45:5-32. WW, Hirsh H. Morgan Kaufmann; 1994:121-129. 35. Hasegawa K, Miyashita Y, Funatsu K: GA Strategy for Variable 8. Kohavi R, John GH: Wrappers for feature subset selection. Artif Selection in QSAR Studies: GA-Based PLS Analysis of Cal- Intell 1997, 97:273-324. cium Channel Antagonists. J Chem Inf Comput Sci 1997, 9. Dudek AZ, Arodz T, Gálvez J: Computational methods in devel- 37:306-310. oping quantitative structure-activity relationships (QSAR): a 36. Selwood DL, Livingstone DJ, Comley JCW, O'Dowd AB, Hudson AT, review. Comb Chem High Through Screen 2006, 9:213-228. Jackson P, Jandu KS, Rose VS, Stables JN: Structure-activity rela- 10. Liu Y: A comparative study on feature selection methods for tionships of antifilarial antimycin analogues: a multivariate drug discovery. J Chem Inf Comput Sci 2004, 44:1823-1828. pattern recognition study. J Med Chem 1990, 33:136-142. 11. Whitney AW: A direct method of nonparametric measure- 37. Balaban AT: Highly discriminating distance-based topological ment selection. IEEE Trans Comput 1971, 20:1100-1103. index. Chem Phys Lett 1982, 89:399-404. 12. Marill T, Green DM: On the effectiveness of receptors in recog- nition systems. IEEE Trans Inform Theory 1963, 9:11-17. 13. Pudil P, Novovièová J, Kittler J: Floating search methods in fea- ture selection. Patt Recog Lett 1994, 15:1119-1125. 14. Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using support vector machines. Mach Learn 2002, 46:389-422. 15. Fröhlich H, Wegner JK, Zell A: Towards optimal descriptor sub- set selection with support vector machines in classification and regression. QSAR Comb Sci 2004, 23:311-318. 16. Goldberg DE: Genetic Algorithms in Search, Optimization and Machine Learning Boston: Kluwer Academic Publishers; 1989. 17. Rogers D, Hopfinger AJ: Application of genetic function approx- imation to quantitative structure-activity relationships and quantitative structure-property relationships. J Chem Inf Com- put Sci 1994, 34:854-866. 18. Wegner JK, Zell A: Prediction of aqueous solubility and parti- tion coefficient optimized by a genetic algorithm based descriptor selection method. J Chem Inf Comput Sci 2003, 43:1077-1084. 19. von Homeyer A: Evolutionary Algorithms and their Applica- tions in Chemistry. In Handbook of Chemoinformatics Volume 3. Edited by: Gasteiger J. Weinheim: Wiley-VCH; 2003:1239-1280. 20. Agrafiotis DK, Cedeno W: Feature selection for structure-activ- ity correlation using binary particle swarms. J Med Chem 2002, 45:1098-1107. 21. Lin WQ, Jiang JH, Shen Q, Shen GL, Yu RQ: Optimized block-wise variable combination by particle swarm optimization for partial least squares modeling in quantitative structure- Publish with ChemistryCentral and every activity relationship studies. J Chem Inf Model 2005, 45:486-493. 22. Guha R, Jurs PC: Development of linear, ensemble and nonlin- scientist can read your work free of charge ear models for the prediction and interpretation of the bio- logical activity of a set of PDGFR inhibitors. J Chem Inf Comput Open access provides opportunities to our Sci 2004, 44:2179-2189. colleagues in other parts of the globe, by allowing 23. Vapnik VN: The nature of statistical learning theory New York: Springer anyone to view the content free of charge. Verlag; 1995. 24. Hastie T, Tibshirani R, Friedman J: The elements of statistical learning: W. Jeffery Hurst, The Hershey Company. data mining, inference, and prediction New York: Springer; 2001. available free of charge to the entire scientific community 25. Smola AJ, Schölkopf B: A tutorial on support vector regression. Stat Comput 2004, 14:199-222. peer reviewed and published immediately upon acceptance 26. Shen Q, Jiang JH, Tao JC, Shen GL, Yu RQ: Modified Ant Colony cited in PubMed and archived on PubMed Central Optimization Algorithm for Variable Selection in QSAR yours you keep the copyright Modeling: QSAR Studies of Cyclooxygenase Inhibitors. J Chem Inf Model 2005, 45:1024-1029. Submit your manuscript here: http://www.chemistrycentral.com/manuscript/ Page 15 of 15 Chem. Cent. J. 2008, 2, 21. (page number not for citation purposes)
  • 71. BMC Bioinformatics BioMed Central Software Open Access Userscripts for the Life Sciences Egon L Willighagen*1, Noel M O'Boyle2, Harini Gopalakrishnan3, Dazhi Jiao3, Rajarshi Guha3, Christoph Steinbeck4 and David J Wild3 Address: 1Cologne University Bioinformatics Center, Cologne University, Cologne, Germany, 2Cambridge Crystallographic Data Centre, Cambridge, UK, 3School of Informatics, Indiana University, Bloomington, USA and 4Wilhelm-Schickard-Institut, Center for Bioinformatics, University of Tübingen, Tübingen, Germany Email: Egon L Willighagen* - egonw@users.sf.net; Noel M O'Boyle - baoilleach@gmail.com; Harini Gopalakrishnan - hgopalak@indiana.edu; Dazhi Jiao - djiao@indiana.edu; Rajarshi Guha - rguha@indiana.edu; Christoph Steinbeck - c.steinbeck@steinbeck-molecular.de; David J Wild - djwild@indiana.edu * Corresponding author Published: 21 December 2007 Received: 31 August 2007 Accepted: 21 December 2007 BMC Bioinformatics 2007, 8:487 doi:10.1186/1471-2105-8-487 This article is available from: http://www.biomedcentral.com/1471-2105/8/487 © 2007 Willighagen et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: The web has seen an explosion of chemistry and biology related resources in the last 15 years: thousands of scientific journals, databases, wikis, blogs and resources are available with a wide variety of types of information. There is a huge need to aggregate and organise this information. However, the sheer number of resources makes it unrealistic to link them all in a centralised manner. Instead, search engines to find information in those resources flourish, and formal languages like Resource Description Framework and Web Ontology Language are increasingly used to allow linking of resources. A recent development is the use of userscripts to change the appearance of web pages, by on-the-fly modification of the web content. This opens possibilities to aggregate information and computational results from different web resources into the web page of one of those resources. Results: Several userscripts are presented that enrich biology and chemistry related web resources by incorporating or linking to other computational or data sources on the web. The scripts make use of Greasemonkey-like plugins for web browsers and are written in JavaScript. Information from third-party resources are extracted using open Application Programming Interfaces, while common Universal Resource Locator schemes are used to make deep links to related information in that external resource. The userscripts presented here use a variety of techniques and resources, and show the potential of such scripts. Conclusion: This paper discusses a number of userscripts that aggregate information from two or more web resources. Examples are shown that enrich web pages with information from other resources, and show how information from web pages can be used to link to, search, and process information in other resources. Due to the nature of userscripts, scientists are able to select those scripts they find useful on a daily basis, as the scripts run directly in their own web browser rather than on the web server. This flexibility allows the scientists to tune the features of web resources to optimise their productivity. Page 1 of 12 BMC Bioinformatcs. 2007, 8, 487. (page number not for citation purposes)
  • 72. BMC Bioinformatics 2007, 8:487 http://www.biomedcentral.com/1471-2105/8/487 Background as identifiers [13], indicating that a specific database entry The web has seen an explosion of chemistry and biology is related to the cited term in the ontology, and therefore related resources in the last 15 years: thousands of scien- related to entries from other databases annotated with tific journals, databases, wikis, blogs, and regular HTML that term. pages are available containing information relevant to chemists and biologists [1-4]. While each of those Identifiers that can be calculated algorithmically are even resources is valuable in itself, integrating information better, because they do not need to be looked up in a list from these resources increases the value even more: for of identifiers. Instead, anyone can calculate them from the example, PubChem provides a wealth of data but could be object itself. For example, for molecular structures the complemented with 3D models to create an even richer InChI [14] is the ideal replacement for database specific information source. identifiers such as the CAS registration number, the PubChem compound identifier and the ChEBI identifier. The original goal of the world wide web was to hyperlink These all require a look up or conversion table to convert individual web pages allowing humans to explore a web one identifier into another. Using the InChI, one can look of knowledge. For individual web pages these links can be up information in all databases without having to know created manually, as is still done in blogs, wikis, and static the database specific identifier. HTML pages; for large databases this is, however, not fea- sible. Userscripts are small programs that can alter the In addition to the unique identifier, one additional func- HTML content rendered by web browsers. For example, a tionality is needed to create a link to a particular database: userscript may add book prices from competitors to the the database must provide either an API (Application Pro- Amazon.com website, or may remove unwanted adver- gramming Interface) which can be queried using the iden- tisements from a site. Using the same approach, user- tifier or else provide a uniform scheme for deep linking to scripts can also solve the problem of interlinking web a web page containing information about the entry resources, by adding to web pages of one resource dynam- behind the identifier. For example, looking up structures ically generated hyperlinks into another. By selecting a in PubChem is done with a scheme in which the InChI is specific set of userscripts, the user can tune a website to embedded verbatim. To look up the structure of methane provide all kinds of facilities not anticipated by the origi- (InChI=1/CH4/h1H4), the URL nal author of the site. For example, userscripts have been http:www.ncbi.nlm.nih.gov/entrequery.fcgi?CMD=searc used in bioinformatics to enhance the iHOP web page [5]: hDB=pccompoundterm=%22InChI=1/CH4/h1H4%2 the script extracts user assigned tags from a third party 2[InChI] is used. resource, and shows them as a tag cloud on iHOP pages for particular genes. The plethora of resources is overwhelming, and both users and database developers may have preferred subsets, e.g. Automatic hyperlinking is only possible though the use of more trusted, resources. It is therefore worthwhile to have unique identifiers such as the PDB ID, the CAS registra- a system that allows users to choose which resources they tion number and, more recently, the IUPAC International want to have linked with which other resources. User- Chemical Identifier (InChI). While identifiers are easily scripts provide the necessary technology to allow this used to connect databases, such as done in the SRS system within web browsers. Here we describe several userscripts [6] or in meta database software like BioWarehouse [7], we have developed to create links between web resources the sheer number of web resources makes it impossible of interest to researchers in the life sciences. integrate all resources. Consequently, (bio)chemical search engines, such as ChemSpider [8] and tools to har- Implementation vest information from web resources, such as ChemX- We use the following techniques to link various web treme [9] and BioSpider [10,11], as well as systems that resources in this paper: userscripts, unique identifiers, standardize algorithmic access to resources and services, microformats, and web resource interfaces. The following such as BioMOBY [12], have emerged. sections describe how these are used in this work. Another reason why identifiers do not always allow link- Userscripts ing resources is that many of them are database specific, A userscript is a small program written in JavaScript that is such as the PDB ID and the Digital Object Identifier automatically run within a web browser (often by a plugin (DOI), and sometimes even restricted in being used, as or add-on) when the user accesses pages that match a par- with the CAS registration number. Open standard identi- ticular URL. Userscripts allow the user to modify the fiers address this problem. Such identifiers can be derived HTML content of a web page on-the-fly, by adding or from ontologies, dictionaries, encyclopedia, or computed removing elements or by moving them around. For exam- by an algorithm. The Gene Ontology terms are often used ple, userscripts exist that remove pop-up advertisements Page 2 of 12 BMC Bioinformatcs. 2007, 8, 487. (page number not for citation purposes)
  • 73. BMC Bioinformatics 2007, 8:487 http://www.biomedcentral.com/1471-2105/8/487 from web pages, and that alter the Amazon.com web page expressions to find certain strings in the text of the web to provide book prices from alternative suppliers. A repos- page. This works particularly well for identifiers with a itory of userscripts exists at userscripts.org/citeuserscripts- unique and well described syntax. For example, a regular dotorg. Chemists and biologists can find relevant expression for InChIs will have fewer false positives than userscripts by searching with the terms chemistry or one for PDB identifiers. biology. As with any program that you run on your computer, it is Of the popular web browsers, only Opera provides built- important to consider security when installing userscripts. in support for userscripts (referred to as 'User JavaScript'). Although the security model used by Greasemonkey pre- To enable userscript support in other browsers, a third- vents attacks by malicious websites, it is unable to detect party extension needs to be installed: Greasemonkey [15] or prevent the user himself installing a malicious user- for Firefox, Creammonkey [16] for Safari, IE7pro [17] or script. Such scripts do exist; recently, malicious userscripts Turnabout [18] for Internet Explorer. The userscripts pre- were uploaded to Userscripts.org that attempted to steal sented in the Results section are targeted at Greasemon- information from users' cookies. In that case, once the key, although it should be possible to run them in any problem was discovered the malicious userscripts were browser with only minor changes. easily detected and removed by the administrator. We rec- ommend that unless you are familiar with JavaScript and The web browser user has full control over which user- carefully inspect the source code, you should only install scripts she wants to have installed, allowing her to cus- userscripts from a trusted source. tomise web pages exactly the way she wishes. Once installed, it is possible to individually enable or disable Unique identifiers installed scripts. For example, for Greasemonkey see the Recognition of biological and chemistry relevant informa- Manage User Scripts option in the Tools menu under tion on web pages is simplified by using identifiers [19]. Greasemonkey, or to disable the extension completely, Such identifiers may or may not be marked up with click on the Greasemonkey icon in the status bar. Further semantic markup such as microformats (see below). Iden- control is provided by specifying to which web pages the tifiers are widely used to make connections between data- script applies. Userscripts define default rules (e.g. http:// bases, and often identify a specific entry in a database. www.biomedcentral.com/), but the user is normally able Some examples of this are the PDB identifier, Digital to override these. Object Identifiers, PubChem compound identifier, and the CAS registry number for, respectively the PDB, DOI, The userscript has two main methods to find the HTML PubChem, and the Chemistry Abstract Service databases. content to which to add or remove elements. The most In this study we use DOIs, InChIs, and PDB identifiers as accurate one is to analyse the document object model our unique identifiers (Table 1). (DOM). This approach is used by the Sechemtic userscript to find uses of chemical microformats (see example below under Microformats). The other method is to use regular Table 1: Userscripts for the life sciences. A summary of the resources and identifiers used by userscripts for the life sciences. The Identification method indicates how the userscript recognises relevant information on a web page. The Identifiers column describes the unique identifier searched for. The Resources column indicates the web resource to which a link is created, or from which data is extracted. Technologies and Resources used Name Identification method Identifiers Resources Jmol4PubChem HTML tags on PubChem PubChem ID Pub3D [36] OSCAR3 on HTML natural language processing chemical structure name - PDB-Jmol regular expression PDB ID First Glance in Jmol [41] Sechemtic microformats InChI, SMILES, CAS number PubChem [32] eMolecules [43] Google [54] Add quotes to DOIs regular expression DOI Postgenomic [3] Chemical blogspace [4] Add quotes to molecules microformats InChI Chemical blogspace [4] Add to Connotea regular expression DOI Connotea [30] Page 3 of 12 BMC Bioinformatcs. 2007, 8, 487. (page number not for citation purposes)
  • 74. BMC Bioinformatics 2007, 8:487 http://www.biomedcentral.com/1471-2105/8/487 Microformats %22[InChI]; Microformats [20] are a lightweight specification that extends HTML to add semantic markup to web pages. For newElement.innerHTML = example, hCard is a microformat that allows semantic mark up of address information [21], and hCalendar is a supPubChem/sup; microformat specification for the representation of calen- dar information about events [22]. spanElement.parentNode.insertBefore( A microformat specification has also been suggested for newElement, spanElement.nextSibling chemistry that would make it much easier to recognise compound names, InChIs, SMILES and CAS registry num- ); bers. Userscripts, or indeed any other programs, would then no longer need to depend on regular expressions to } find names and identifiers, but could use this markup to accurately extract the identifier. Web resource interfaces Web databases are the primary source of information used For example, a web page implementing the InChI micro- by the discussed userscripts. While it is easy to have scripts format would wrap any InChIs in a HTML span ele- create links to external web resources, it is also possible for ment with a @class attribute as follows: span them to retrieve information from those resources and class=inchiInChI=1//span. This information can include it in the HTML content of the web page the user is easily be extracted using the document.evaluate method browsing. The latter is, for example, performed by the which takes an XPath [23] expression (//span[@class=in userscript that adds comments from Postgenomic.com chi] in this case): and Chemical blogspace to journal web pages. allInChIs = document.evaluate( The general approach userscripts use to retrieve informa- tion from external web resources uses HTTP just like any '//span[@class=inchi]', document, null, web browser itself. To simplify the process, userscripts tend to use a combination of XMLHttpRequest, possibly XpathResult.UNORDERED_NODE_SNAPSHOT_TYPE, via the Greasemonkey GM_xmlhttpRequest wrapper method, and the JavaScript Object Notation (JSON) for- null mat [24] for data representation. The XMLHttpRequest method retrieves the information using a URL that nor- ); mally points to a data interface, or API. The Postge- nomic.com software has such an API that returns the blog This code returns all HTML nodes that mark up InChI posts that discuss a particular article, as identified by its strings using the InChI microformat. By iterating over DOI. Chemical blogspace uses the same API, and adds these nodes, the userscript can insert new HTML elements, another one to return blog posts that discuss a particular such as links to external resources as shown here in code molecule, as identified by its InChI. Both database APIs taken from the Sechemtic userscript: can return the information as JSON objects, which is how they are used in the discussed userscripts. for (var i=0; iallInChIs.snapshotLength; i++){ Since our userscripts rely on a particular API or specially- spanElement = allInChIs.snapshotItem(i); constructed URL to access an external resource, they will fail if the external resource changes its API or the URL it inchi = spanElement.innerHTML; provides to access it. This will not affect the browsing experience of the user, but the additional functionality // create a link to PubChem provided by the userscript will no longer be available. To deal with this, each of the userscripts described in this arti- newElement = document.createElement('a'); cle checks once a day for a new version and prompts the user to install it if one is available. This means that when newElement.href = http://www.ncbi.nlm. + a userscript is updated to deal with a new API or URL, every user will quickly have access to the latest version. nih.gov/entrez/query.fcgi?CMD=search + DB=pccompoundterm=%22 + inchi + Page 4 of 12 BMC Bioinformatcs. 2007, 8, 487. (page number not for citation purposes)
  • 75. BMC Bioinformatics 2007, 8:487 http://www.biomedcentral.com/1471-2105/8/487 Results which can mark documents up automatically. In particu- This paper introduces userscripts that have been written in lar, OSCAR3 [25,26], developed at the Unilever Centre for our research groups as exemplars of how web resources Molecular Informatics at the University of Cambridge, can be integrated and to outline how they can be used in and used by the Royal Society of Chemistry in their research. Our userscripts can be classified into two broads Project Prospect [27], searches documents for chemical areas: those that link chemical and biological data to web- names, spectra, and other chemical information, and sites, and those that affect how we interact with the scien- automatically marks up the content using XML tags (to tific literature. the extent of where possible generating machine readable SMILES and InChI structures for chemicals referenced in In the following sections, we describe in detail how func- the document). tionality is added to the web page being browsed. Table 1 summarises the resources linked to, or accessed, by each We have created a userscript, ChemGM.user.js that will script, as well as the unique identifier used. automatically run OSCAR on a web page and provide inline hypertext links to PubChem for chemical structure Interacting with the scientific literature names that are found in the page (including 2D structure OSCAR3 running on HTML depictions generated by another web service and Published journal articles and other web documents with PubChem searches). The userscript can be run on any web chemistry content are not normally marked up by the page, but it is particularly applicable to online journal publishers or authors to provide machine readable repre- articles and chemistry blogs. An example highlighting the sentations of chemical structures and related information. effect of this userscript is shown in Figure 1. Note that As a result, there has been active interest in methods though the images use an article from Chemistry Central Highlighting and annotating chemical terms in an online journal article Figure 1 Highlighting and annotating chemical terms in an online journal article. Screenshots showing the effect of the ChemGM.user.js userscript on the Chemistry Central Journal web page (full URL: [47]) for Majumder et al. [48]. (a) When the userscript is running a toolbar is added to the top of every webpage. Clicking the highlight button in the toolbar causes the contents of the webpage to be analysed for chemical terms. (b) shows the original text of the abstract. (c) After a minute or so, any chemical terms recognised are highlighted in yellow, and are annotated with hypertext links to their entries in PubChem (if available) and a 2D depiction of the image. Page 5 of 12 BMC Bioinformatcs. 2007, 8, 487. (page number not for citation purposes)
  • 76. BMC Bioinformatics 2007, 8:487 http://www.biomedcentral.com/1471-2105/8/487 Journal, the script can be applied to any web page, irrespec- ever the user accesses the website of a journal publisher. It tive of its source or content. identifies any DOIs on the page, and uses the Chemical blogspace and Postgenomic APIs to find out whether Add quotes from Chemical blogspace and Postgenomic to DOIs those DOIs have been referenced in a blog post. If so, an It can be a challenge to keep up with the primary literature icon is added to the web page next to the DOI which, if in a field. At the same time, there are a large number of sci- hovered over with the mouse, causes a popup to appear entific blogs, many of which have reviews of the recent lit- containing the name of the citing blog post, the blog erature or highlight interesting papers. The Postgenomic name, and the first few lines of text of the blog. The full web site was developed by Euan Adie and later hosted by content can be accessed by clicking on the title of the blog Nature Publishing Group and currently aggregates infor- post. In this way, content from blog articles widely dis- mation from over 750 scientific blogs [3]. The source code persed in terms of the web is brought directly to where it is open and has been used by one of the authors (ELW) to is likely to be of most interest – the journal web site. Fig- establish a similar site, Chemical blogspace, for over 140 ure 2 shows the effect of this userscript when running on blogs with chemical content [4]. Both of these sites iden- the HTML version of Spjuth et al. [29]. tify references to journal articles in blogs, and make this information available through an API. Compared to the Providing reviews of journal articles is only one of the uses Postgenomic website, the Chemical blogspace site also of such a userscript. It is also a general way to create a link identifies molecules referenced in blogs either by micro- between the content of a blog post and a particular paper. format markup of InChI and SMILES, or by analysing In this way, bloggers can use blog posts to enhance the links to Wikipedia [28]. If the latter link points to a wiki original journal website without any intervention page that contains a PubChem compound identifier or an required by the publisher. For example, the author of a InChI, then the molecular structure is linked to the blog paper may write a blog post which provides additional post. supporting information for a journal article or includes the article preprint for those who do not have a subscrip- This userscript uses the aggregated information collected tion. Alternatively, the author of a paper may write a blog by Postgenomic and Chemical blogspace. It runs when- post and include the DOIs of all of the references. This Figure 2 Adding information to DOIs on journal web pages Adding information to DOIs on journal web pages. Screenshots from the BMC Bioinformatics web page (full URL: [49]) for Spjuth et al. [29] (a) without any userscript enabled, and (b) showing the effect of the two userscripts Add quotes from Chemical blogspace and Postgenomic to DOIs and Add to Connotea. The latter added a Connotea logo (a 'c' surrounded by linking arrows), which links to the Connotea dialog box for adding this paper to your library, and a number indicating how many people have already bookmarked this paper, which links to the existing entry for this paper on Connotea. The Add quotes userscript added the Cb logo, which links to the Chemical blogspace page for this paper, and a Pg logo, linking to the Postgenomic page. The popup titled Powered by Postgenomic.com (only partially shown) appears when the mouse is placed on the Pg logo, and contains quotes from and links to the citing blog articles. Page 6 of 12 BMC Bioinformatcs. 2007, 8, 487. (page number not for citation purposes)
  • 77. BMC Bioinformatics 2007, 8:487 http://www.biomedcentral.com/1471-2105/8/487 would not only promote his/her own paper (all of the On a technical note, this userscript illustrates some tech- cited papers would show a blog comment pointing to the niques necessary for accessing an API that requires a user citing paper), but would result in an eventual network of name and password and that, in addition, only permits citations which could be used to measure the impact of a one API request every two seconds or so. Note that this paper. userscript requires the user to have a Connotea account (which is freely available at Ref. [30]). Add to Connotea Connotea is a social bookmarking site developed by Linking to chemical and biological data sources Nature Publishing Group for scientists [30]. It allows a Enhancement of PubChem with 3D structures user to bookmark websites using either the DOI or a URL, The PubChem repository is a public collection of over 10 and to tag those bookmarks. Crucially, it also provides an million compounds [32]. The database contains 2D struc- API for retrieving information. tures as well as a number of precomputed properties (such as number of heavy atoms and topological polar surface The Add to Connotea userscript has two aspects. Firstly, area [33]). The web interface to this database allows a it makes it easy to add papers to Connotea from journal wide variety of queries. The results are usually represented webpages, by adding a hyperlink in the form of the Con- in the form of a summary web page containing images of notea logo next to every DOI identified on a journal web the 2D structures of all the compounds satisfying the page. Clicking on the logo brings the user to the Connotea query with links to pages for individual compounds page for adding new papers. This aspect of the userscript which provide a summary of the properties of the com- is not entirely novel. A userscript has previously been pound. In many cases it would be useful to be able to view developed which allows the user to add papers to Con- an image of the 3D structure of a molecule. However, notea from NCBI PubMed [31]. In addition, a small PubChem currently does not contain 3D structures for the number of publishers (which includes BioMed Central compounds stored in the database. and Nature Publishing Group), provide a facility to add papers to Connotea directly from their website. Our user- To address this problem, we developed a database of 3D script differs in that it will work on the website of any jour- structures of PubChem compounds as part of our web nal publisher where the text contains DOIs. service infrastructure for chemoinformatics [34]. The structures were generated using a two-step process in The second aspect of this userscript is more interesting in which the SMILES were converted to a set of rough 3D the context of this paper. The userscript queries the Con- coordinates using stochastic proximity embedding [35] notea API to find out how many people have previously and subsequently geometry optimised using the MMFF94 added this paper to their Connotea account. It then adds force field, using in-house code. A number of compounds this number next to the Connotea icon. Clicking on the were excluded from the final 3D database since the force number brings you to the Connotea page for that paper. field did not contain parameters for certain atom types. From here it is possible to access comments on the paper. However the 3D database, known as Pub3D [36], con- More useful perhaps, is the ability to find related papers tains approximately 99% of the compounds in PubChem. by looking at the other papers a particular Connotea user Pub3D is wrapped by a set of web services which encapsu- has tagged with the same tag. Figure 2 shows the effect of late common queries including finding a structure by this userscript when running on the HTML version of compound ID (CID) or finding structures matching a Spjuth et al.[29]. SMARTS pattern. This aspect of the userscript has the potential to affect the Using this web service interface we created a userscript way we read the literature. The number of times a particu- called 3DStructureView.user.js that allows 3D structures lar paper has been bookmarked on Connotea can be con- from our database to be shown when users visit the sidered a measure of its importance or its interest. In the PubChem website (see Figure 3). The script is designed to past, measures such as the number of citations have served work only on the summary and detail pages that a user this purpose, but this information is generally not shown views after a PubChem search. It parses the page and iden- on journal web pages as it is not freely available. Another tifies the compound ID which is then used to construct a effect of this userscript is to link the paper the user is view- call to the Pub3D database. The return value is a string ing to related papers through the Connotea website. If a containing the 3D structure of the compound, in SD for- researcher finds that a particular paper has been book- mat, which is used to construct an appropriate URL. The marked on Connotea and is of interest to him or her, he result of this process is that the user can now click on a or she can is likely to find other relevant papers by brows- link titled 3DView(Jmol), which will cause a Jmol applet ing through the other papers bookmarked by the same [37-39] window to appear showing the 3D structure of the Connotea user with the same tag. compound in question. Page 7 of 12 BMC Bioinformatcs. 2007, 8, 487. (page number not for citation purposes)
  • 78. BMC Bioinformatics 2007, 8:487 http://www.biomedcentral.com/1471-2105/8/487 Figure 3 Adding 3D models to PubChem Adding 3D models to PubChem. Screenshot of the PubChem web page for aspirin (full URL: [50]) with the 3DStructureView userscript enabled. The userscript added the first line of text in the compound summary information. Clicking on the 3DView(Jmol) link causes a window to popup showing a 3D model of the structure. Clicking on the SDF Format link allows the user to download the calculated 3D structure of the molecule in SDF file format. As an example, after installing the script, one can navigate download the 3D structure in the SD file format (see Fig- to the PubChem website [32] and search for entries ure 3). related to aspirin. This should return slightly more than thirty hits. If one then clicks on the compound ID for the PDB-Jmol Greasemonkey Script first hit, one is taken to a summary page which provides The Protein Data Bank [40] is a repository of experimen- various details regarding the molecular structure and bio- tally-determined 3D coordinates of proteins. Each entry logical activity of aspirin. In addition to the data provided has a PDB ID, which is a unique four letter identification by PubChem, the userscript has enhanced the page to add code consisting of a number followed by three characters two links: 3DView(Jmol) and SDF Format. The former link which can be either letters or numbers; for example, 1abe, will bring up an instance of the Jmol applet showing a 3D 114L and 6NN9. The PDB-Jmol Greasemonkey userscript structure of aspirin, while the second link allows one to identifies all PDB IDs on web pages and adds hyperlinks Page 8 of 12 BMC Bioinformatcs. 2007, 8, 487. (page number not for citation purposes)
  • 79. BMC Bioinformatics 2007, 8:487 http://www.biomedcentral.com/1471-2105/8/487 to the FirstGlance in Jmol web page [41] for that protein. in particular with Google searches, the links based on This website uses the Open Source molecular viewer Jmol InChIs are more useful as the same molecule may be rep- to show the protein as a 3D model which can be manipu- resented by several different SMILES strings but only a sin- lated by the user. In this way, the user can instantly view gle InChI. the 3D structure of any PDB ID mentioned on a website, and in particular, if the user is reading the HTML version From a technological point of view, these scripts are very of a journal article on-line, all PDB IDs in the paper will simple in nature; the semantic nature of the (chemical) similarly be enhanced. Figure 4 shows an example of the microformats is what makes this simple script possible. latter case where PDB identifiers in the the online version The semantic markup in HTML for InChIs that is picked of Mardia et al. [42] have been identified and links added. up by the userscript looks like span class=inchiInChI =1/CH4/h1H4/span while the markup for a SMILES As this userscript runs on all web pages accessed by the string looks like span class=smilesCCO/span. user, and since the search term is simply 4 characters long, additional constraints are necessary to prevent excessive Add quotes from Chemical blogspace to molecules false positive identification. The userscript only looks for This userscript, quite similar to the one that adds com- PDB IDs if it finds one of the following terms in the web ments to DOIs, runs on all web pages accessed by the user. page: protein, PDB, or enzyme. Using the same method as the Sechemtic userscript (see above), it identifies any molecules referenced on a page Sechemtic which have been marked up with the appropriate tags. It Sechemtic is a small userscript that detects use of micro- also supports the (non-marked up) InChI tags on formats (see Implementation) to markup molecular iden- PubChem. It then uses the Chemical blogspace API to find tifiers, as well as regular molecular names. It recognises out whether this molecule has been referenced in a blog markup for the IUPAC InChI and SMILES, and creates post. The remainder is as for the previous userscript; an links for those molecules to web resources like eMolecules icon is added which contains a popup to the citing blog [43], PubChem [32] and a link to Google to search for post. Figure 6 shows the effect of the userscript on the more information (see Figure 5). It should be noted that, PubChem page for methane (InChI=1/CH4/h1H4). A full Figure 4 The effect of the PDB-Jmol userscript The effect of the PDB-Jmol userscript. Screenshots from the BMC Bioinformatics web page (full URL: [51]) for Mardia et al. [42] showing a paragraph containing PDB identifiers (a) without the PDB-Jmol userscript installed, and (b) with the PDB-Jmol userscript installed. The Jmol text in yellow is a hyperlink to the FirstGlance in Jmol page [41] for a particular protein struc- ture. Page 9 of 12 BMC Bioinformatcs. 2007, 8, 487. (page number not for citation purposes)
  • 80. BMC Bioinformatics 2007, 8:487 http://www.biomedcentral.com/1471-2105/8/487 Figure 5 Annotating chemical terms marked up with microformats Annotating chemical terms marked up with microformats. Screenshots showing a blog post (full URL: [52]) containing chemical terms marked up with chemical microformats, (a) without and (b) with the Sechemtic userscript enabled. The added hyperlinks allow the user to look up the structure in Google, ChemSpider and PubChem. list of molecules with comments in Chemical blogspace is and any use of chemical microformats will be picked up available from Ref. [44]. adding links to Google, eMolecules, ChemSpider and PubChem. A possible use of this script is to link all discussions of a particular drug in the blogosphere to a static page contain- These examples show that userscripts offer a powerful ing information on the drug. Another use is to link discus- technology to improve the way we read the scientific liter- sions on syntheses of molecules to pages containing ature and access (bio)chemical databases. This is done by references to the molecule. dynamically combining web resources, and enriching the information content of the primary resources. Theoreti- Discussion cally, such links can be made on the web server itself, and Here we have focused on the development of userscripts this is commonly done, but it does not give the user the that enhance web pages for biologists and chemists. If all flexibility to choose what features to install. The crucial of these userscripts are installed, any web page with a PDB point about userscripts is that they do not require the code will now contain a link to view the structure in 3D, involvement of the web site provider. All of the enhance- journal webpages will show chemical structure markup ments are done on-the-fly by the user's browser. and blog comments on articles, 3D structures and links to appropriate blog posts will be available from PubChem, Page 10 of 12 BMC Bioinformatcs. 2007, 8, 487. (page number not for citation purposes)
  • 81. BMC Bioinformatics 2007, 8:487 http://www.biomedcentral.com/1471-2105/8/487 Figure 6 Adding comments from the blogosphere to molecules Adding comments from the blogosphere to molecules. Screenshots from the PubChem web page for methane (full URL: [53]), (a) without and (b) with the Add quotes from Chemical blogspace to molecules userscript enabled. The InChI InChI=1/CH4/h1H4 is identified by the userscript, which then adds the Cb logo. The logo is a link to the Chemical blogspace page for this molecule. The popup titled Powered by Chemical blogspace (only partially shown) appears when the mouse is placed on the Cb logo, and contains quotes from and links to blog posts that discuss this molecule. The userscripts combine a number of technologies for Conclusion data retrieval and communication. Information from We have shown that userscripts are a simple and useful HTML pages is extracted using identifiers, regular expres- way of integrating bio- and chemoinformatics web sions, XPath queries and microformats. It is noted that the resources. In particular, they permit (a) the augmentation syntax of (bio)chemical and other identifiers is generally of existing websites with functionality not envisioned or not distinct enough to detect them with perfect recall and indeed wanted by the original author, (b) the integration optimal precision. It is easiest to write regular expressions of information from different domains, and (c) a connec- for the DOI and the InChI with a high precision, com- tion point between the social web (wikis, blogs etc.) and pared to, for example, the PDB ID which has a syntax traditional web tools and sites. We continue to find inter- which can clash with other web page content. esting uses for userscripts, and we hope this manuscript will spur others to do likewise. Microformats offer a solution for such less well-defined identifiers. This technology is used to wrap identifiers Availability and requirements with some semantic markup so that the userscript can eas- • Project name: Userscripts for Chemistry and Biology ily extract the identifiers using XPath queries. However, microformats do not incorporate a mechanism to provide • Project home page: Blue Obelisk [45] website [46]. details on what a microformat means. That is, microfor- Download link: http://blueobelisk.sf.net/wiki/Userscripts mats are not backed up by a specified ontology. As a result the chemical 'smiles' microformat, to markup SMILES, • Operating system(s): Platform independent may collide with a microformat specification to markup moods. • Programming language: JavaScript Once the identifier is extracted by whatever means, the • Other requirements: Firefox with Greasemonkey add- userscripts can either create links to other web resources, on (or equivalent) for userscript support; Java is required or query those resources and embed results into the HTML to view the Jmol applet; a Connotea account is required of the web page on which the userscript is run. While any for the Add to Connotea userscript HTTP-based approach can be used for this, the example userscripts show that combining XMLHttpRequest with • License: GNU GPL, BSD JSON [24] is a rather straightforward approach. • Any restrictions to use by non-academics: none Page 11 of 12 BMC Bioinformatcs. 2007, 8, 487. (page number not for citation purposes)
  • 82. BMC Bioinformatics 2007, 8:487 http://www.biomedcentral.com/1471-2105/8/487 Authors' contributions 28. Wikipedia [http://www.wikipedia.org/] 29. Spjuth O, Helmus T, Willighagen EL, Kuhn S, Eklund M, Wagener J, NMOB, ELW, HG and DJ have written userscripts men- Murray-Rust P, Steinbeck C, Wikberg JES: Bioclipse: An open tioned in this text. RG developed and maintains the 3D source workbench for chemo- and bioinformatics. BMC Bioin- structure database and contributed to the development of formatics 2007, 8:59. 30. Connotea [http://www.connotea.org/] the Pub3D userscript. DW and CS devised and tested 31. pubmed2connotea 2006 [http://lindenb.integragen.org/ some of the userscripts. All authors have read and pubmed2connotea/]. approved the final manuscript. 32. PubChem [http://pubchem.ncbi.nlm.nih.gov/] 33. Ertl P, Rohde B, Selzer P: Fast Calculation of Molecular Polar Surface Area as a Sum of Fragment Based Contributions and Acknowledgements Its Application to the Prediction of Drug Transport Proper- Pedro Beltrão is acknowledged for laying the foundations of the userscript ties. J Med Chem 2000, 43:3714-3717. 34. Dong X, Gilbert KE, Guha R, Heiland R, Kim J, Pierce ME, Fox GC, that adds blog comments to journal web pages. We thank the anonymous Wild DJ: Web Service Infrastructure for Chemoinformatics. reviewers for their constructive comments. J Chem Inf Model 2007, 47:1303-1307. 35. Agrafiotis DK: Stochastic Proximity Embedding. J Comp Chem 2003, 24:1215-1221. References 36. Pub3D [http://rguha.ath.cx/~rguha/cicc/p3d/] 1. Galperin MY: The Molecular Biology Database Collection: 37. Jmol: an open-source Java viewer for chemical structures in 2007 update. Nucleic Acids Res 2007, 35:D3-D4. 3D [http://www.jmol.org] 2. Fox JA, McMillan S, Ouellette BFF: A compilation of molecular 38. Willighagen E, Howard M: Fast and Scriptable Molecular Graph- biology web servers: 2006 update on the Bioinformatics ics in Web Browsers without Java3D. Available from Nature Pre- Links Directory. Nucleic Acids Res 2006, 34:W3-W5. cedings 2007 [http://dx.doi.org/10.1038/npre.2007.50.1]. 3. Postgenomic [http://postgenomic.com/] 39. Herráez A: Biomolecules in the computer: Jmol to the rescue. 4. Chemical blogspace [http://cb.openmolecules.net/] Biochem Mol Biol Edu 2006, 34:255-261. 5. Good BM, Kawas EA, Kuo BY, Wilkinson MD: iHOPerator: User- 40. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, scripting a personalized bioinformatics Web, starting with Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids the iHOP website. BMC Bioinformatics 2006, 7:534. Res 2000, 28(1):235-242. 6. Etzold T, Ulyanov A, Argos P: SRS: information retrieval system 41. FirstGlance in Jmol [http://firstglance.jmol.org/] for molecular biology data banks. Method Enzymol 1996, 42. Mardia KV, Nyirongo VB, Green PJ, Gold ND, Westhead DR: Baye- 266:114-128. sian refinement of protein functional site matching. BMC Bio- 7. Lee T, Pouliot Y, Wagner V, Gupta P, Calvert DS, Tenenbaum J, Karp informatics 2007, 8:257. P: BioWarehouse: a bioinformatics database warehouse 43. eMolecules [http://emolecules.com/] toolkit. BMC Bioinformatics 2006, 7:170. 44. Chemical blogspace – Chemical Compounds [http://cb.open 8. ChemSpider [http://chemspider.com/] molecules.net/inchis.php] 9. Karthikeyan M, Krishnan S, Pandey AK, Bender A: Harvesting 45. Guha R, Howard MT, Hutchison GR, Murray-Rust P, Rzepa H, Stein- Chemical Information from the Internet Using a Distributed beck C, Wegner J, Willighagen EL: The Blue Obelisk-interopera- Approach: ChemXtreme. J Chem Inf Model 2006, 46:452-461. bility in chemical informatics. J Chem Inf Model 2006, 10. BioSpider [http://biospider.ca/] 46:991-998. 11. Knox C, Shrivastava S, Stothard P, Eisner R, Wishart D: BioSpider: 46. Blue Obelisk Userscripts [http://blueobelisk.sourceforge.net/ A Web Server for Automating Metabolome Annotations. wiki/Userscripts] Pacific Symp Biocomp 2007, 12:145-156. 47. Chemistry Central Journal web page for Majumder et al 12. Wilkinson MD, Links M: BioMOBY: an open source biological [http://journal.chemistrycentral.com/content/1/1/10] web services proposal. Brief Bioinform 2002, 3(4):331-341. 48. Majumder AB, Shah S, Gupta MN: Enantioselective transacetyla- 13. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, tion of (R,S)-β-citronellol by propanol rinsed immobilized Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel- Rhizomucor miehei lipase. Chem Cent J 2007, 1:10. Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, 49. BMC Bioinformatics web page for Spjuth et al [http:// Rubin GM, Sherlock G: Gene ontology: tool for the unification www.biomedcentral.com/1471-2105/8/59] of biology. The Gene Ontology Consortium. Nat Genet 2000, 50. PubChem page for aspirin [http://pubchem.ncbi.nlm.nih.gov/ 25:25-29. summary/summary.cgi?cid=2244] 14. IUPAC International Chemical Identifier (InChI) [http:// 51. BMC Bioinformatics web page for Mardia et al [http:// www.iupac.org/inchi/] www.biomedcentral.com/1471-2105/8/257] 15. Greasemonkey [http://www.greasespot.net/] 52. Counting consitutional isomers from the molecular formula 16. Creammonkey [http://creammonkey.sourceforge.net/] [http://chem-bla-ics.blogspot.com/2006/12/counting-stereoisomers- 17. IE7pro [http://www.ie7pro.com/] from-molecular_17.html] 18. Turnabout [http://www.reifysoft.com/turnabout.php] 53. PubChem page for methane [http://pubchem.ncbi.nlm.nih.gov/ 19. Coles SJ, Day NE, Murray-Rust P, Rzepa HS, Zhang Y: Enhance- summary/summary.cgi?cid=297] ment of the chemical semantic web through the use of InChI 54. Google [http://google.com/] identifiers. Org Biomol Chem 2005, 3(10):1832-1834. 20. Microformats [http://microformats.org/] 21. hCard Microformat [http://microformats.org/wiki/hcard] 22. hCalendar Microformat [http://microformats.org/wiki/hcalen dar] 23. XML Path Language (XPath) 2.0 – W3C Recommendation [http://www.w3.org/TR/2007/REC-xpath20-20070123/] 24. JSON [http://json.org/] 25. Townsend JA, Adams SE, Waudby CA, de Souza VK, Goodman JM, Murray-Rust P: Chemical documents: machine understanding and automated information extraction. Org Biomol Chem 2004, 2:3294-3300. 26. Corbett P, Murray-Rust P: High-thoughput identification of chemistry in life science texts. In Computational Life Sciences II Vol- ume 4216. Edited by: Berthold MR, Glen R, Fischer I. Berlin/Heidel- berg: Springer-Verlag; 2006:107-118. 27. RSC Project Prospect [http://www.rsc.org/Publishing/Journals/ ProjectProspect/] Page 12 of 12 BMC Bioinformatcs. 2007, 8, 487. (page number not for citation purposes)
  • 83. O’Boyle et al. Journal of Cheminformatics 2011, 3:8 http://www.jcheminf.com/content/3/1/8 SOFTWARE Open Access Confab - Systematic generation of diverse low- energy conformers Noel M O’Boyle1,2*, Tim Vandermeersch2, Christopher J Flynn1, Anita R Maguire1 and Geoffrey R Hutchison2,3 Abstract Background: Many computational chemistry analyses require the generation of conformers, either on-the-fly, or in advance. We present Confab, an open source command-line application for the systematic generation of low- energy conformers according to a diversity criterion. Results: Confab generates conformations using the ‘torsion driving approach’ which involves iterating systematically through a set of allowed torsion angles for each rotatable bond. Energy is assessed using the MMFF94 forcefield. Diversity is measured using the heavy-atom root-mean-square deviation (RMSD) relative to conformers already stored. We investigated the recovery of crystal structures for a dataset of 1000 ligands from the Protein Data Bank with fewer than 1 million conformations. Confab can recover 97% of the molecules to within 1.5 Å at a diversity level of 1.5 Å and an energy cutoff of 50 kcal/mol. Conclusions: Confab is available from http://confab.googlecode.com. Introduction systematic search code from DOCK5 [14] to generate The generation of molecular conformations is an essen- diverse conformers via a torsion-driving approach. tial part of many computational analyses in chemistry, Confab 1.0 is the first release of Confab, an open particularly in the field of computational drug design. source conformation generator whose goal is the sys- Methods such as 3D QSAR, protein-ligand docking and tematic coverage of conformational space. Accuracy has pharmacophore generation and searching [1] all require been favoured over the introduction of approximations the generation of conformers, whether on-the-fly (as part to improve performance. The algorithm starts with an of the method) or pre-generated by a stand-alone confor- input 3D structure which, after some initialisation steps, mer generator. In contrast to 3D structure generators is used to generate multiple conformers which are (such as CORINA [2], DG-AMMOS [3] and smi23d [4]), filtered on-the-fly to identify diverse low energy confor- which focus on the generation of a single low-energy mers. Conformations are generated using the torsion- conformation, conformation generators create an ensem- driving approach from a set of predefined allowed torsion ble of conformers that cover the entire space of low- angles. Ring conformations are not currently sampled. energy conformations or that part of conformational The first section of the paper describes the algorithm space occupied by biologically-relevant conformers. used by the software and some implementation details. Several proprietary conformation generators are cur- After this, two applications of the software are rently available (including OMEGA [5], ROTATE [6], described: an analysis of the conformational space of a Catalyst [7], Confort [8], ConfGen [9], Balloon [10] and dataset of 1000 molecules (which includes a comparison MED-3DMC [11] among others) but only recently have to Multiconf-DOCK), and an investigation of the con- open source conformation generators appeared: Frog2 formational preferences of a particular phenyl sulfone. [12] generates conformers using a Monte Carlo approach, while Multiconf-DOCK [13] adapts the Methods Algorithm * Correspondence: baoilleach@gmail.com The Confab algorithm is outlined in Figure 1. The input 1 Analytical and Biological Chemistry Research Facility, University College required is a 3D structure with reasonable bond lengths Cork, Western Road, Cork, Co. Cork, Ireland and angles. Since the algorithm does not currently Full list of author information is available at the end of the article © 2011 O’Boyle et al; licensee Chemistry Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. J. Cheminf. 2011, 3, 8.
  • 84. O’Boyle et al. Journal of Cheminformatics 2011, 3:8 Page 2 of 9 http://www.jcheminf.com/content/3/1/8 reducing the number of conformations that will be tested. 2-fold symmetry is identified when a rotatable bond involves an sp2 hybridised carbon atom where the neighbouring two atoms affected by the rotation are both of the same symmetry class. When this occurs the allowed values of that torsion are halved by restricting them to those less than 180°. The same is done for the case of 3-fold symmetry at an sp 3 hybridised carbon where the three neighbours are of the same symmetry class; in this case the torsion angles are restricted to those less than 120°. If graph symmetry is identified at both ends of a rotatable bond, the result is multiplica- tive; a 2-fold and a 3-fold symmetry combine to restrict allowed values of the torsion angles to 360/6 = 60°. The next step is to obtain an estimate of the energy of the most stable conformer. Throughout Confab, ener- gies are calculated using the MMFF94 forcefield [16]. The values of the bond stretching, angle bend, stretch bend and out-of-plane bending terms are constant for all conformers of the same molecule; only the torsion, Van der Waals and electrostatic terms were repeatedly evaluated. A low energy conformer is found using a sim- ple greedy algorithm. Each torsion angle is optimised starting with the most central torsion and proceeding outwards. As this procedure is relatively fast (compared to the combinatorial problem of searching for the global optimum) it is repeated up to 16 times by testing the Figure 1 Flowchart depicting the Confab algorithm. four most central torsions in different orders. The low- est energy conformer found is used as a reference point for applying an energy cutoff during the conformer explore ring conformations, any rings present should be search. If, during the actual conformer generation a in reasonable conformations. lower energy conformer is found, this lower energy is The first step of the algorithm is the identification of used instead for the reference from that point on. rotatable bonds. These are defined as all acyclic single The main part of the algorithm is the systematic gen- bonds where both atoms of the bond are connected to eration and assessment of all conformers described by at least two non-hydrogen atoms, but neither atom of the allowed torsion angles. Confab generates each of the bond is sp-hybridised. Note that this definition these in turn up to a user-specified cutoff (the default is excludes rotation around bonds that interchange hydro- 10 6 ) and determines its energy relative to the lowest gens (for example, the rotation of the hydrogens of a energy conformer found so far. If this is within a user- methyl group), but this does not imply any loss of accu- specified energy cutoff (50 kcal/mol by default), it is racy as it is usual practice to exclude hydrogens when assessed for diversity to the conformers already stored calculating the RMSD (see below). (see below). If it is found to be diverse, it is itself stored The method used by Confab to generate conformations otherwise it is discarded. The algorithm then moves is known as the torsion-driving approach. A set of onto the next conformer. allowed torsion angles for each rotatable bond is assigned Rather than iterate in a ‘depth-first’ manner over the to each bond by searching for a match to predefined torsions and their allowed angles, Confab uses a Linear SMARTS strings in a user-configurable file (torlib.txt) Feedback Shift Register (LFSR) to iterate in a random included in the Confab distribution. This file is part of order over all of the conformers. A LFSR allows the the Open Babel project and it assigns values to particular generation of all integers from 1 to N pseudorandomly rotatable bonds using data from Huang et al. [15]. without repetition and without any memory overhead Once the allowed torsion angles are assigned, they are (which is important for large values of N). By iterating corrected for topological (that is, graph) symmetry. The randomly, Confab avoids biasing generated conformers presence of such symmetry allows performance to be towards a particular region of conformational space, for improved by eliminating redundant evaluations, thus example towards the input conformation. It also helps J. Cheminf. 2011, 3, 8.
  • 85. O’Boyle et al. Journal of Cheminformatics 2011, 3:8 Page 3 of 9 http://www.jcheminf.com/content/3/1/8 increase diversity if the number of possible conforma- child nodes containing that conformation are added at tions is greater than the cutoff for the number tested. successively lower levels until the bottom level is Diversity is ensured by calculating the heavy-atom reached. Overall, there are two possibilities; either the RMSD (after least-squares alignment) of the newly gen- algorithm reaches the bottom level and finds that the erated conformation to those previously stored. The new conformation is within the RMSD cutoff of an alignment is carried out using the QCP algorithm of existing conformer, in which case it is discarded, or else Theobald [17] (which we found to be about twice as fast it is of sufficient diversity to be stored at some level of as the popular Kabsch alignment method [18]). Despite the tree. this, when a molecule has many conformers and a large This algorithm greatly reduces the number of RMSD number of conformers have been stored, full pairwise evaluations during the conformer generation loop. How- RMSD calculations take an excessive amount of time. ever it does not eliminate all conformations that are To minimise the number of RMSD evaluations required similar to those already stored; conformations may be to discard a conformer, chosen conformers are stored in retained that differ by less than the RMSD cutoff if they a tree structure that effectively clusters conformers on- end up in different branches. To prune the set of the-fly by RMSD. Figure 2(a) shows a typical ‘diversity retained conformations, while still avoiding a computa- tree’ where each level of the tree is associated with a tionally expensive pairwise RMSD calculation, all of the smaller RMSD diversity from 3.0 Å down to the cutoff retained conformations are added one-by-one to a new specified by the user (1.6 Å in the figure). Each node of tree in order of increasing energy. This time the algo- the tree represents a stored conformation. Sibling nodes rithm used for adding conformations to the diversity (that is, nodes at the same level that share the same par- tree is more robust: all sibling conformations are tested ent node) differ by at least the RMSD diversity asso- for similarity, even after finding one that is similar. The ciated with that level. Note that sibling nodes are result is that the same conformation may be added at ordered and that the first child node of each parent is several different points in the tree. This makes the tree the same as the parent itself. more effective at eliminating similar conformations at To illustrate the algorithm, let us imagine adding a the expense of a greater number of RMSD calculations. new conformation H to the tree depicted in Figure 2(a). Calculation of an RMSD can be overestimated when a The algorithm starts at the top of the tree and deter- molecule’s structure has automorphisms (a permutation mines which of the two branches (A or B) to take at the of the atoms of a molecule that preserves the bond con- 3.0 Å diversity level. To do so it checks whether H is nections). For example, if you consider a para-substituted within 3.0 Å RMSD of A. If so, it follows the tree down phenyl ring where two conformations differ by a rotation to the next level, and checks to see whether it is within of 180° around the substituted carbons, it is clear that the 2.0 Å RMSD of A (note that it does not need to recalcu- calculated RMSD between the conformations should be late the RMSD to do this). If this is not true, then it 0. However, if the symmetry of the phenyl ring is not checks for 2.0 Å similarity to C. If so, it follows C down taken into account this will not be the case and the to the next level; otherwise it checks against D. If it is RMSD will be overestimated as the corresponding atoms not similar to D, H is stored in the tree as the next sib- of the two structures have moved. The symmetry- ling at that level of the tree (this is depicted in Figure 2 corrected RMSD is obtained by iterating over the auto- (b)). When adding a new node for a conformation at a morphisms of the molecule and taking the minimum particular level, if the level is not at the bottom then value of the resulting RMSDs. For performance reasons, Figure 2 An example diversity tree used to filter conformations on-the-fly. (a) A diversity tree containing five conformations (A to E) used to filter conformations with an RMSD of less than 1.6 Å to one of the stored conformations. (b) The same diversity tree after addition of conformer H, where H is within 3.0 Å of A but not within 2.0 Å of A, C or D. J. Cheminf. 2011, 3, 8.
  • 86. O’Boyle et al. Journal of Cheminformatics 2011, 3:8 Page 4 of 9 http://www.jcheminf.com/content/3/1/8 the calculation of the RMSD is not symmetry-corrected the 3D structure using Open Babel. After the initial during the main conformation generation loop. However structure generation, the structures were optimised it is used afterwards when building the final diversity using the MMFF94 forcefield (200 steps steepest des- tree, thereby eliminating any conformations that were cent). Since Confab does not explore ring conforma- retained in error. tions, ring conformations were taken from the crystal structure for the initial structure generation. See Addi- Implementation tional file 2 for the generated structures. Confab is essentially a modified version of Open Babel [19], a widely-used cheminformatics toolkit written in Results C++ and available under the open source GPL v2 Figure 3(b) shows an overview of the dataset of 1000 licence [20]. In fact, some of the code written for Con- structures in terms of the number of rotatable bonds in fab has been merged into the main Open Babel distribu- each molecule. Although the dataset contains molecules tion (such as the original Kabsch alignment code) but with up to 12 rotatable bonds, it is clear by comparison due to an additional dependency (on tree.hh, see below) with the full dataset of Borodina et al. in Figure 3(a) the core code has not been included in Open Babel v2.3. that the reduced dataset is only a representative sample The MMFF94 forcefield, the conformer generation fra- for molecules having up to 7 rotatable bonds. Beyond mework and the automorphism detection are all pro- this, the restriction that the molecule must have fewer vided by Open Babel. QCP alignment was implemented than 1 million conformers leads to the elimination of using Theobald’s public domain code [21] in combina- most of the molecules. For this reason, to avoid tion with the Eigen2 high performance linear algebra library [22]. The diversity analysis code relies on a tree data structure provided by the Open Source tree.hh library [23]. The code used to implement the Linear Feedback Shift Register (LFSR) was adapted from its cor- responding Wikipedia article [24]. Tap values for the reg- ister were taken from Alfke’s Xilinx application note [25]. The Confab distribution contains two command-line applications: confab and calcrmsd. The former imple- ments the Confab algorithm to generate conformers given an input 3D structure, while the latter may be used to assess the performance of confab by comparing the generated conformers to a file containing crystal structures. Full details of these applications are available on the Confab website. Coverage of Conformational Space Dataset To illustrate the performance of Confab, we used a dataset of 1000 small molecule crystal structures derived from that of Borodina et al. [26]. The original source is the PDB; thus this dataset represents bioactive confor- mations of molecules. The 3D structures of the 14504 ligands in the Borodina dataset were obtained using the PubChem Download Service (using the PubChem Sub- stance IDs from Borodina et al.). Of these, 16 could not be handled by the MMFF94 forcefield, 5202 had no rotatable bonds (this fraction included a large number of trivial salts) and 2348 had more than 1 million con- formers (according to Confab’s torsion rules). 1000 structures were randomly chosen from the 6938 remain- ing. See Additional file 1 for the structures of these 1000 molecules. Figure 3 The distribution of molecules in terms of the number To avoid bias towards the crystal structures, the input of rotatable bonds in (a) the dataset of Borodina et al., and (b) our dataset of 1000 molecules. conformations for Confab were generated by building J. Cheminf. 2011, 3, 8.
  • 87. O’Boyle et al. Journal of Cheminformatics 2011, 3:8 Page 5 of 9 http://www.jcheminf.com/content/3/1/8 erroneous conclusions some of the following analyses (where stated) will not consider molecules having 8 or more rotatable bonds. Confab was used to exhaustively generate all low energy conformers for each molecule in the dataset for diversity values ranging from 0.4 Å to 3.0 Å RMSD. The default setting of 50 kcal/mol was used as an energy cutoff. The default value of 1 million conformers was used as the conformer cutoff; this ensured exhaustive coverage of conformational space (as defined by Con- fab’s torsion rules) as structures with more conformers were not included in the dataset (see above). Figure 4 shows the mean time for conformer generation per molecule. This is largely independent of the diversity level for diversity levels greater than or equal to 1.0 Å. For values less than this, an increasing amount of time is spent performing the pairwise RMSD calculations against stored conformations. Performance of conformer generators is typically mea- sured by the percent recovery of crystal structures with respect to a particular RMSD cutoff (see for example Ref [9]). This is simply the percentage of molecules which have a generated conformer within a particular RMSD of the crystal structure. Commonly used values for this RMSD cutoff are 2.0, 1.5 and 1.0 Å. Figure 5(a) shows the percent recovery at these cutoffs for different values of the RMSD diversity. At 2.0 Å RMSD diversity, 99% are within 2.0 Å RMSD of the crystal (83% within 1.5, 41% within 1.0); at 1.5 Å RMSD diversity, 99% are within 2.0 Å (97% within 1.5, 50% within 1.0); at 1.0 Å RMSD diversity, 99% are within 2.0 Å RMSD (98% within 1.5, 89% within 1.0). As expected, Figure 5 Performance measured as % recovery of crystal structures. (a) Performance for different RMSD cutoffs. The diversity cutoff is where the value of the RMSD diversity is used as the RMSD cutoff. (b) The RMSD cutoff required to achieve a particular level of % recovery. The diagonal line indicates the maximum RMSD cutoff expected when there is complete coverage. the percentage of crystal structures that are found decreases as the RMSD diversity increases. In particular, the curves fall off steeply once the RMSD diversity is greater than the required cutoff. An interesting question to ask is what RMSD diversity is required to recover X% of crystal structures with respect to a particular RMSD cutoff? Figure 5(b) shows the answer to this where X is 90%, 95% or 98%. For example to find 95% of the crystal structures within a 2.0 Å cutoff an RMSD diversity of 2.4 Å (or smaller) is Figure 4 Effect of diversity level on speed of conformer required, but to find the same percentage to within generation. Times were measured on an Intel Xeon E5620 1.5 Å an RMSD diversity of 1.6 Å is needed. However, Processor (2.4GHz, 4C) with 32GB RAM. even an RMSD diversity of 0.4 Å will not recover 98% J. Cheminf. 2011, 3, 8.
  • 88. O’Boyle et al. Journal of Cheminformatics 2011, 3:8 Page 6 of 9 http://www.jcheminf.com/content/3/1/8 of the structures to within 1.0 Å (it only recovers 96%), different levels of RMSD diversity when the RMSD cut- an indication of the inherent diversity of the generated off used is the same as the diversity level. The sharp fall conformers as discussed further below. off below 1.4 Å is a deviation from the ideal behaviour As pointed out by Borodina et al. [26], if the confor- described by Borodina et al. mational space is perfectly covered and lacks any ‘holes’ Table 1 shows the median number of generated con- then the RMSD diversity is an upper bound of the mini- formers tested for molecules with different numbers of mum RMSD to the crystal structure. In other words, at rotatable bonds. Broadly speaking, about one third of an RMSD diversity of 1.5 Å for example, all crystal the conformers pass the energy cutoff applied. Although structures should be found to within 1.5 Å. The diago- the size of each individual subset is not very large, and nal line in Figure 5(b) indicates the maximum RMSD the values for 6 rotatable bonds seem to be biased cutoff expected if this ideal behaviour is observed. It is towards a larger number of conformers, some general clear from the figure that at low RMSD diversity the points can still be made. actual performance is poorer than this. The number of diverse conformers is much reduced There are two main problems that give rise to gaps in by a higher diversity level. For example, for those mole- conformational coverage. The first is that the allowed cules with 7 rotatable bonds there are approximately torsion values may not encompass the specific torsion 11000 low energy conformers of which about 13% are angle observed in the crystal structure. For this dataset, diverse at 0.5 Å RMSD, only 1.3% are diverse at 1.0 Å there are 7 molecules for which the crystal structure RMSD, and only 0.16% are diverse at 1.5 Å RMSD. could not be found within 2.0 Å even at 0.4 Å RMSD The values in Table 1 are in broad agreement with diversity. These molecules (PubChem substance IDs of those reported by Smellie et al. [27] for a representative 584680, 823881, 825747, 826196, 828032, 830919 and subset of their dataset (see table three therein). They 834618), of which two represent different conformations make the point that the number of conformers required of the same molecule, all involve sugar moieties and it to cover conformational space is really surprisingly low. may be that the allowed torsion angles of the glycosidic For a molecule with 7 rotatable bonds in our dataset, bond are too conservative. conformational space can be covered to within 1.0 Å The second is that the granularity of the allowed tor- with merely hundreds of conformations while just tens sion settings may not be sufficiently fine to allow solu- of conformations will achieve a coverage of 1.5 Å. Of tions to be found to within a low RMSD cutoff. For course, these figures are expected to increase with each example, a carbon-carbon single bond has 12 allowed additional rotatable bond. torsion values from 0 to 360° in increments of 30°. If For completeness, Table 1 also reports median values such a bond is centrally located in a large molecule, for the minimum RMSD to the crystal structure. How- even if the crystal structure has similar torsion angles to ever, as a metric for coverage these values give a mis- one of these conformers the RMSD may differ leadingly positive picture compared to the percent significantly. recovery values discussed above. Based on this dataset, the inherent granularity of the Confab generated conformers is around 1.4 Å, as indi- Comparison with Multiconf-DOCK cated by the “Diversity Cutoff” line in Figure 5(a) which Multiconf-DOCK [13] is another open source conformer falls off sharply as the RMSD diversity decreases below generator that uses a torsion driving approach to imple- 1.4 Å. This line indicates the percent recovery at ment a systematic search to identify diverse low energy Table 1 Relationship between the number of rotatable bonds, the number of conformers generated and the minimum RMSD to the crystal structure Rotatable Number of Total Conformers Low Energy Diverse Conformers (median) Minimum RMSD to crystal (median) bonds† molecules (median) Conformers (median) 0.5 Å 1.0 Å 1.5 Å 2.0 Å 3.0 Å 0.5 Å 1.0 Å 1.5 Å 2.0 Å 3.0 Å 1 214 3 3 3 1 1 1 1 0.18 0.40 0.45 0.45 0.45 2 97 36 25 8 2 1 1 1 0.34 0.54 0.74 0.80 0.80 3 216 72 44 19 4 1 1 1 0.39 0.70 1.02 1.06 1.06 4 143 1296 582 96 9 2 1 1 0.52 0.80 1.07 1.14 1.24 5 86 3024 1065 189 24 4 1 1 0.60 0.82 1.14 1.31 1.34 6 114 186624 24317 2953 192 24 5 1 0.71 0.90 1.21 1.49 1.78 7 69 34992 10679 1402 139 17 4 1 0.66 0.83 1.14 1.44 1.73 † The 61 molecules with 8 or more rotatable bonds are omitted. J. Cheminf. 2011, 3, 8.
  • 89. O’Boyle et al. Journal of Cheminformatics 2011, 3:8 Page 7 of 9 http://www.jcheminf.com/content/3/1/8 conformers. This software differs in that it uses the difficult to say whether this represents a less compre- AMBER force field [28,29] (as implemented in DOCK5) hensive coverage of conformational space or whether instead of MMFF94. In addition, it implements perfor- this is due to the use of different forcefields. In terms of mance improvements such as search tree pruning by the minimum RMSD to the crystal structure, once again partial energy estimation [14]. Like Confab, the software we see that Multiconf-DOCK performs better than Con- requires a 3D structure as input. fab at the 2.0 Å and 1.5 Å RMSD diversity levels but Multiconf-DOCK was used to generate conformations Confab is better at 1.0 Å RMSD diversity. for the 1000 structures in the dataset using the same input as for Confab but converted to MOL2 using Open Distance Distribution in Conformations of a Babel v2.3.0. It should be noted that the specified Sybyl Phenyl Sulfone atom types in the input MOL2 file have an effect on the Many conformer generators are focused on reproducing conformations generated by Multiconf-DOCK. The bioactive conformations. However it is worth remember- parameters used were taken from the example provided ing that the generation of conformers may also be useful with the Multiconf-DOCK distribution, except that no in other contexts. Here we use Confab to as an aid to restriction was placed on the number of generated con- interpret the NMR spectra for the phenyl sulfone shown formations and the energy cutoff was set to 50 kcal/mol in Figure 6. The peak for the methylene carbon of the (as used for Confab). Three different RMSD diversity ethyl ester was split unexpectedly (compared to an ana- levels were investigated: 2.0 Å, 1.5 Å and 1.0 Å. For all logous sulfone where the phenyl group was replaced by three diversity levels, the mean time spent per molecule tert-butyl), and our hypothesis was that this was due to was 6.3 s (measured on the same machine used for the close approach of the methylene carbon to one of Figure 4). the sulfonyl oxygens in solution. Confab was used to The performance in terms of percent recovery is as investigate whether low energy conformations existed follows: at 2.0 Å RMSD diversity, 99% are within 2.0 Å where the methylene group was in close proximity to a RMSD of the crystal structure (89% within 1.5, 55% sulfonyl oxygen. within 1.0); at 1.5 Å RMSD diversity, 99% are within 2.0 Confab was used to generate a set of conformations of Å (97% within 1.5, 64% within 1.0); at 1.0 Å RMSD the molecule with a diversity of 0.2 Å and no energy diversity, 99% are within 2.0 Å (98% within 1.5, 80% cutoff. The resulting 2014 conformations were optimised within 1.0). These values are broadly similar to those for using a MMFF94 forcefield (200 steps steepest descent; Confab (see above). The most noticeable differences implemented using Pybel [30]) and the final energy occur for the percentage of structures found to within recorded. For each of the conformations the minimum 1.0 Å RMSD; assuming that both programs successfully distance between a sulfonyl oxygen and the methylene remove conformations that are within the diversity cut- carbon was measured. off, Multiconf-DOCK outperforms Confab at the 2.0 Å Figure 7 shows a plot of these distances versus the and 1.5 Å RMSD diversity levels but Confab performs relative energies of the conformers with marginal histo- better at 1.0 Å RMSD diversity. grams showing the distribution of values. The methylene Table 2 shows the median number of conformers gen- carbon does not approach the sulfonyl group very clo- erated by Multiconf-DOCK, along with the minimum sely. For low energy conformers, the distances are clus- RMSD to the crystal structure, broken down by the tered around 4.0 Å and 5.4 Å with the former more number of rotatable bonds. Compared to Confab the frequent. Taking 5 kcal/mol as a cutoff, the distance can number of conformers generated is far fewer. It is be as low as 3.7 Å but shorter distances (down to 3.0 Å) Table 2 Results for Multiconf-DOCK showing the relationship between the number of rotatable bonds, the number of conformers generated and the minimum RMSD to the crystal structure Rotatable bonds Diverse Conformers (median) Minimum RMSD to crystal (median) 1.0 Å 1.5 Å 2.0 Å 1.0 Å 1.5 Å 2.0 Å 1 1 1 1 0.34 0.40 0.40 2 3 1 1 0.50 0.67 0.71 3 2 1 1 0.68 0.78 0.81 4 9 3 1 0.76 0.97 1.05 5 14 4 2 0.85 1.03 1.28 6 43 15 5 1.08 1.23 1.37 7 21 8 3 1.04 1.24 1.40 J. Cheminf. 2011, 3, 8.
  • 90. O’Boyle et al. Journal of Cheminformatics 2011, 3:8 Page 8 of 9 http://www.jcheminf.com/content/3/1/8 that reduce the search space on the basis of heuristics have been avoided for this reason. Using the results from Confab 1.0 as a comparison, future work will investigate strategies to to overcome the combinatorial explosion associated with large num- bers of rotatable bonds [31] including the trade-off between speed and accuracy. Availability and Requirements Project name: Confab Project home page: http://confab.googlecode.com Operating system(s): Cross-platform Programming language: C++ Other requirements (if compiling): CMake 2.4+, Eigen2 Licence: GPL v2 Any restrictions to use by non-academics: None Figure 6 Structure of the phenyl sulfone studied. Additional material are only possible with an associated energy penalty. Additional file 1: Crystal structures used to test conformational Figure 6 shows one of the low energy conformations coverage. This is a text file in SDF format containing biological conformations (as downloaded from PubChem) of 1000 molecules. This (relative energy of 4.6 kcal/mol) which has a distance of is a subset of the data used in the study by Borodina et al. 3.7 Å between the groups of interest. Additional file 2: Generated 3D structures used to test conformational coverage. This is a text file in SDF format containing Conclusion 3D structures of the 1000 molecules in the dataset generated using Open Babel. These were used as the input to Confab. The goal of this first release of Confab is to ensure com- plete coverage of all of the low energy conformers of a molecule. While every effort is made to maximise perfor- mance, accuracy has been the main goal. Approximations Acknowledgements and Funding NMOB is supported by a Health Research Board Career Development Fellowship, PD/2009/13. We thank several beta testers for their valuable feedback, and the anonymous reviewers for their constructive comments. Author details 1 Analytical and Biological Chemistry Research Facility, University College Cork, Western Road, Cork, Co. Cork, Ireland. 2Open Babel development team. 3 Department of Chemistry, University of Pittsburgh, Chevron Science Center, 219 Parkman Avenue, Pittsburgh, PA 15260, USA. Authors’ contributions NMOB devised and implemented Confab, and carried out the coverage analysis. GRH implemented the conformer generation framework in Open Babel and contributed to the forcefield code. TV implemented the automorphism code in Open Babel and contributed to the forcefield code. NMOB collaborated with CJF and ARM on the sulfone investigation. All authors read and approved the final manuscript. Received: 9 February 2011 Accepted: 16 March 2011 Published: 16 March 2011 References 1. Schwab CH: Conformations and 3D pharmacophore searching. Drug Discov Today Tech 2010, 7:e245-e253. 2. Sadowski J, Gasteiger J, Klebe G: Comparison of Automatic Three- Dimensional Model Builders Using 639 X-Ray Structures. J Chem Inf Comput Sci 1994, 34(4):1000-1008. Figure 7 Scatterplot with marginal histograms of distance 3. Lagorce D, Pencheva T, Villoutreix BO, Miteva MA: DG-AMMOS: A New tool versus energy for the set of conformations of the phenyl to generate 3D conformation of small molecules using Distance sulfone in Figure 6. Geometry and Automated Molecular Mechanics Optimization for in silico Screening. BMC Chem Biol 2009, 9:6. J. Cheminf. 2011, 3, 8.
  • 91. O’Boyle et al. Journal of Cheminformatics 2011, 3:8 Page 9 of 9 http://www.jcheminf.com/content/3/1/8 4. Gilbert K, Guha R: smi23d. [http://www.chembiogrid.org/cheminfo/smi23d/]. 5. Hawkins PCD, Skillman AG, Warren GL, Ellingson BA, Stahl MT: Conformer Generation with OMEGA: Algorithm and Validation using High Quality Structures from the Protein Databank and Cambridge Structural Database. J Chem Inf Model 2010, 50:572-584. 6. Renner S, Schwab CH, Gasteiger J, Schneider G: Impact of Conformational Flexibility on Three-Dimensional Similarity Searching Using Correlation Vectors. J Chem Inf Model 2006, 46:2324-2332. 7. Catalyst. Accelrys Inc: San Diego, CA; [http://accelrys.com/]. 8. Confort. Tripos Inc: St Louis, MO; [http://www.tripos.com/]. 9. Watts KS, Dalal P, Murphy RB, Sherman W, Friesner RA, Shelley JC: ConfGen: A Conformational Search Method for Efficient Generation of Bioactive Conformers. J Chem Inf Model 2010, 50:534-546. 10. Vainio MJ, Johnson MS: Generating Conformer Ensembles Using a Multiobjective Genetic Algorithm. J Chem Inf Model 2007, 47:2462-2474. 11. Sperandio O, Souaille M, Delfaud F, Miteva MA, Villoutreix BO: MED-3DMC: A new tool to generate 3D conformation ensembles of small molecules with a Monte Carlo sampling of the conformational space. Eur J Med Chem 2009, 44:1405-1409. 12. Miteva MA, Guyon F, Tufféry P: Frog2: Efficient 3D conformation ensemble generator for small compounds. Nucleic Acids Res 2010, 38: W622-W627. 13. Sauton N, Lagorce D, Villoutreix BO, Miteva MA: MS-DOCK: Accurate multiple conformation generator and rigid docking protocol for multi- step virtual ligand screening. BMC Bioinformatics 2008, 9:184. 14. Makino S, Kuntz ID: Automated flexible ligand docking method and its application for database search. J Comput Chem 1997, 18:1812-1825. 15. Huang N, Shoichet BK, Irwin JJ: Benchmarking Sets for Molecular Docking. J Med Chem 2006, 49:6789-6801. 16. Halgren TA: Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94. J Comp Chem 1996, 17:490-519. 17. Theobald DL: Rapid calculation of RMSDs using a quaternion-based characteristic polynomial. Acta Cryst A 2005, 61:478-480. 18. Kabsch W: A solution for the best rotation to relate two sets of vectors. Acta Cryst A 1976, 32:922-923. 19. Hutchison GR, Morley C, Vandermeersch T, O’Boyle NM, James C, et al: Open Babel, v2.3. [http://openbabel.org]. 20. Free Software Foundation: GNU General Public License, v2. [http://www. gnu.org/licenses/old-licenses/gpl-2.0.html]. 21. Theobald DL: QCProt, v1.1. [http://theobald.brandeis.edu/qcp/]. 22. Guennebaud G, Jacob B, et al: Eigen, v2.0.15. [http://eigen.tuxfamily.org]. 23. Peeters K: tree.hh, v2.65. [http://tree.phi-sci.com/]. 24. Linear feedback shift register. Wikipedia [http://en.wikipedia.org/wiki/ Linear_feedback_shift_register], Retrieved Aug 11, 2010. 25. Alfke P: Efficient Shift Registers, LFSR Counters, and Long Pseudo- Random Sequence Generators. Xilinx application note 1996 [http://www. xilinx.com/support/documentation/application_notes/xapp052.pdf], XAPP052. 26. Borodina YV, Bolton E, Fontaine F, Bryant SH: Assessment of Conformational Ensemble Sizes Necessary for Specific Resolutions of Coverage of Conformational Space. J Chem Inf Model 2007, 47:1428-1437. 27. Smellie A, Kahn SD, Teig SL: Analysis of Conformational Coverage. 1. Validation and Estimation of Coverage. J Chem Inf Comput Sci 1995, 35(2):285-294. 28. Weiner SJ, Kollman PA, Case DA, Singh UC, Ghio C, Alagona G, Profeta S Jr, Weiner P: A new force field for molecular mechanical simulation of 29. nucleic acids and proteins. J Am Chem Soc 1984, 106:765-784. Weiner SJ, Kollman PA, Nguyen DT, Case DA: An all atom force field for Publish with ChemistryCentral and every simulations of proteins and nucleic acids. J Comput Chem 1986, scientist can read your work free of charge 7:230-252. Open access provides opportunities to our 30. O’Boyle NM, Morley C, Hutchison GR: Pybel: a Python wrapper for the OpenBabel cheminformatics toolkit. Chem Cent J 2008, 2:5. colleagues in other parts of the globe, by allowing 31. Beusen DD, Shands EFB, Karasek SF, Marshall GR, Dammkoehler RA: anyone to view the content free of charge. Systematic search in conformational analysis. J Mol Struct THEOCHEM W. Jeffery Hurst, The Hershey Company. 1996, 370:157-171. available free of charge to the entire scientific community doi:10.1186/1758-2946-3-8 peer reviewed and published immediately upon acceptance Cite this article as: O’Boyle et al.: Confab - Systematic generation of diverse low-energy conformers. Journal of Cheminformatics 2011 3:8. cited in PubMed and archived on PubMed Central yours you keep the copyright Submit your manuscript here: http://www.chemistrycentral.com/manuscript/ J. Cheminf. 2011, 3, 8.
  • 93. O’Boyle Journal of Cheminformatics 2011, 3:10 http://www.jcheminf.com/content/3/1/10 BOOK REPORT Open Access Review of “Data Analysis with Open Source Tools” by Philipp K Janert Noel M O’Boyle Book details Janert PK: Data Analysis with Open day-to-day; essentially classical methods were developed Source Tools Sebastopol, CA: O’Reilly Media 2010 at a time of small and expensive datasets and no com- Cheminformatics has been defined as the application of putational power, and hypothesis testing focused on informatics methods to solve chemical problems [1]. determining whether an effect existed. Today we have Such chemical problems are often represented in terms ample computing power and may be dealing with very of data, be it activity data for a series of compounds or large datasets; also, we are usually more interested in descriptor values for a compound library. While this the size of an effect (practical significance) rather than new book from the O’Reilly stable is not aimed specifi- just whether it exists (statistical significance). cally at cheminformaticians, the subtitle of “A Hands- Topics that could not be squeezed into a chapter On Guide for Programmers and Data Scientists” makes proper have been placed in shorter “Intermezzos” at the it clear that the target audience includes any scientists end of each section. For example, a short section on whose day-to-day work involves analysing and interpret- “What about map/reduce?” at the end of “Mining Data” ing data. reminds the reader that the map/reduce methodology The book is broadly divided into four parts on (much hyped recently) is not a clever algorithm to Graphics: Looking at Data, Analytics: Modeling Data, speed things up, but rather a piece of infrastructure that Computation: Mining Data and Applications: Using makes it convenient to implement algorithms that are Data. First of all, it should be noted that this is not a trivially parallelisable. book about statistics (as Chapter 1 states explicitly). On the negative side, any cheminformatician who has Neither is it a manual for numpy, Sage, matplotlib, been involved with QSAR studies will already be familiar Gnuplot, R and so forth, as might be implied by the with the multivariate analysis methods discussed here title. Instead, Janert focuses on discussing data analysis (Chapters 13 and 14), although I liked the observation methods and techniques in depth, rather than skimming that “you will actually spend more time on data sets that topics by following a cookbook or tutorial approach are totally worthless” in relation to clustering algo- linked to particular software. This is as it should be - rithms. Also there are two chapters (out of 19) which there are already documentation and manuals available will be of little interest as they focus on business intelli- for all of these programs, and the reader is simply gence and financial calculations, although even there the alerted to the availability of the software, its capabilities reader will find an introduction to the use of Berkeley are described and some examples of use shown. DB and SQLite from Python, tools which I highly This is a real practitioner’s book. Janert, a former recommend. There are also cases where the author per- physicist and software engineer, is a consultant in data haps gives too much detail, but this is hardly a criticism - analysis and mathematical modelling. He has taken his in a book of some 500 pages there is plenty of room. hard-won knowledge and tried to get it all down on Overall though, I heartily recommend this book to paper for the reader’s benefit. For example, in a chapter anyone working in cheminformatics whether they with the provocative title of “What you really need to develop methods or apply them. Too often we rely on know about classical statistics” he explains why intro- summary statistics such as mean and standard deviation ductory statistics textbooks seem to cover methods and and forget to actually look at the data. Graphical analy- topics at odds with the problems data analysts deal with sis gives you a feel for the data, and can often highlight problems, interesting features, or mistaken assumptions. Correspondence: baoilleach@gmail.com After reading this book, you should be very aware of Analytical and Biological Chemistry Research Facility, University College Cork, both the advantages and pitfalls of a wide variety of Western Road, Cork, Co. Cork, Ireland © 2011 O’Boyle; licensee Chemistry Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. J. Cheminf. 2011, 3, 10.
  • 94. O’Boyle Journal of Cheminformatics 2011, 3:10 Page 2 of 2 http://www.jcheminf.com/content/3/1/10 analysis methods but you will also be reminded that the goal of data analysis is not a picture or a number but insight. Competing interests The author declares that they have no competing interests. Received: 8 March 2011 Accepted: 24 March 2011 Published: 24 March 2011 Reference 1. Gasteiger J: Introduction. In Chemoinformatics - A Textbook. Edited by: Gasteiger J, Engel T. Weinheim: Wiley-VCH; 2003:1-13. doi:10.1186/1758-2946-3-10 Cite this article as: O’Boyle: Review of “Data Analysis with Open Source Tools” by Philipp K Janert. Journal of Cheminformatics 2011 3:10. Publish with ChemistryCentral and every scientist can read your work free of charge Open access provides opportunities to our colleagues in other parts of the globe, by allowing anyone to view the content free of charge. W. Jeffery Hurst, The Hershey Company. available free of charge to the entire scientific community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours you keep the copyright Submit your manuscript here: http://www.chemistrycentral.com/manuscript/ J. Cheminf. 2011, 3, 10.
  • 95. O’Boyle et al. Journal of Cheminformatics 2011, 3:37 http://www.jcheminf.com/content/3/1/37 RESEARCH ARTICLE Open Access Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on Noel M O’Boyle1*, Rajarshi Guha2, Egon L Willighagen3, Samuel E Adams4, Jonathan Alvarsson5, Jean-Claude Bradley6, Igor V Filippov7, Robert M Hanson8, Marcus D Hanwell9, Geoffrey R Hutchison10, Craig A James11, Nina Jeliazkova12, Andrew SID Lang13, Karol M Langner14, David C Lonie15, Daniel M Lowe4, Jérôme Pansanel16, Dmitry Pavlov17, Ola Spjuth5, Christoph Steinbeck18, Adam L Tenderholt19, Kevin J Theisen20 and Peter Murray-Rust4 Abstract Background: The Blue Obelisk movement was established in 2005 as a response to the lack of Open Data, Open Standards and Open Source (ODOSOS) in chemistry. It aims to make it easier to carry out chemistry research by promoting interoperability between chemistry software, encouraging cooperation between Open Source developers, and developing community resources and Open Standards. Results: This contribution looks back on the work carried out by the Blue Obelisk in the past 5 years and surveys progress and remaining challenges in the areas of Open Data, Open Standards, and Open Source in chemistry. Conclusions: We show that the Blue Obelisk has been very successful in bringing together researchers and developers with common interests in ODOSOS, leading to development of many useful resources freely available to the chemistry community. Background molecules was created as a resource about chemical The Blue Obelisk movement was established in 2005 at structure and nomenclature by biologists [1]. the 229th National Meeting of the American Chemistry The formation of the Blue Obelisk group is somewhat Society as a response to the lack of Open Data, Open unusual in that it is not a funded network, nor does it Standards and Open Source (ODOSOS) in chemistry. follow the industry consortium model. Rather it is a While other scientific disciplines such as physics, biol- grassroots organisation, catalysed by an initial core of ogy and astronomy (to name a few) were embracing interested scientists, but with membership open to all new ways of doing science and reaping the benefits of who share one or more of the goals of the group: community efforts, there was little if any innovation in the field of chemistry and scientific progress was actively • Open Data in Chemistry. One can obtain all hampered by the lack of access to data and tools. Since scientific data in the public domain when wanted 2005 it has become evident that a good amount of and reuse it for whatever purpose. development in open chemical information is driven by • Open Standards in Chemistry. One can find visi- the demands of neighbouring scientific fields. In many ble community mechanisms for protocols and com- areas in biology, for example, the importance of small municating information. The mechanisms for molecules and their interactions and reactions in biolo- creating and maintaining these standards cover a gical systems has been realised. In fact, one of the first wide spectrum of human organisations, including free and open databases and ontologies of small various degrees of consent. • Open Source in Chemistry. One can use other people’s code without further permission, including * Correspondence: baoilleach@gmail.com 1 Analytical and Biological Chemistry Research Facility, Cavanagh Pharmacy changing it for one’s own use and distributing it Building, University College Cork, College Road, Cork, Co. Cork, Ireland again. Full list of author information is available at the end of the article © 2011 O’Boyle et al; licensee Chemistry Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. J. Cheminf. 2011, 3, 37.
  • 96. O’Boyle et al. Journal of Cheminformatics 2011, 3:37 Page 2 of 15 http://www.jcheminf.com/content/3/1/37 Note that while some may advocate also for Open it should be straightforward to develop spectral Access to publications, the Blue Obelisk goals (ODO- annotation and manipulation. However, currently SOS) focus more on the availability of the underlying the Blue Obelisk lacks support for multi-dimensional scientific data, standards (to exchange data), and code NMR and multi-equipment spectra (e.g. GC-MS). (to reproduce results). All three of these goals stem 5. Crystallography: The Blue Obelisk software sup- from the fundamental tenants of the scientific method ports the bi-directional processing of crystal struc- for data sharing and reproducibility. ture files (CIF) and also solid-state calculations such The Blue Obelisk was first described in the CDK as plane-waves with periodic boundary conditions. News [2] and later as a formal paper by Guha et al. [3] There is considerable support for the visualisation of in 2006. Its home on the web is at http://blueobelisk. both periodic and aperiodic condensed objects. org. This contribution looks back on the work carried out by the Blue Obelisk over the past 5 years in the Many of the current operations in installing and run- areas of Open Data, Open Source, and Open Standards ning chemical computations and using the data are inte- in chemistry. gration and customisation rather than fundamental algorithms. It is very difficult to create universal plat- Scope forms that can be distributed and run by a wide range The Blue Obelisk covers many areas of chemistry and of different users, and in general, the Blue Obelisk delib- chemical resources used by neighbouring disciplines (e.g. erately does not address these. Our approach is to pro- biochemistry, materials science). Many of the efforts duce components that can be embedded in many relate to cheminformatics (the scope of this journal) and environments, from stand-alone applications to web we believe that many of the publications in Journal of applications, databases and workflows. We believe that a Cheminformatics could be completely carried out using chemical laboratory with reasonable access to common Blue Obelisk resources and other Open Source chemical software engineering techniques should be able to build tools. The importance of this is that for the first time it customised applications using Blue Obelisk components would allow reviewers, editors and readers to validate and standard infrastructure such as workflows and data- assertions in the journal and also to re-run and re-ana- bases. Where the Blue Obelisk itself produces data lyse parts of the calculation. resources they are normally done with Open compo- However, Blue Obelisk software and data is also used nents so that the community can, if necessary, replicate outside cheminformatics and certainly in the five main them. areas that, for example, Chemical Markup Language Much of the impetus behind Blue Obelisk software is (CML) [4] supports: to create an environment for chemical computation (including cheminformatics) where all of the compo- 1. Molecules: This is probably the largest area for nents, data, specifications, semantics, ontology and soft- Blue Obelisk software and data, and is reflected by ware are Openly visible and discussable. The largest many programs that visualise, transform, convert current uses by the general chemical community are in formats and calculate properties. It is almost certain authoring, visualisation and cheminformatics calcula- that any file format currently in use can be pro- tions but we anticipate that this will shortly extend into cessed by Blue Obelisk software and that properties mainstream computational chemistry and solid-state. can be calculated for most (organic compounds). Although many of the authors are employed as research 2. Reactions: Blue Obelisk software can describe the scientists, there are also several people who contribute semantics of reactions and provide atom-atom in their spare time and we anticipate an increasing value matching and analyse stoichiometric balance in and use of the Blue Obelisk in education at all levels. reactions. 3. Computational chemistry: Blue Obelisk software Open Source can interpret many of the current output files from The development of Open Source software has been one calculations and create input for jobs. The Quixote of the most successful of the Blue Obelisk’s activities. project (see below and elsewhere in this issue) shows The following sections describe recent work in this area, that Open Source approaches based on Blue Obelisk and Table 1 provides an overview of the projects dis- resources and principles are increasing the availabil- cussed and where to find them online. ity and re-usability of computational chemistry. 4. Spectra: 1-D spectra (NMR, IR, UV etc.) are fully Cheminformatics toolkits supported in Blue Obelisk offerings for conversion Open Source toolkits for cheminformatics have now and display. There is a limited amount of spectral existed for nearly ten years. During this period, some analysis but the software gives a platform on which toolkits were developed from scratch in academia, J. Cheminf. 2011, 3, 37.
  • 97. O’Boyle et al. Journal of Cheminformatics 2011, 3:37 Page 3 of 15 http://www.jcheminf.com/content/3/1/37 Table 1 Blue Obelisk Open Source Software projects discussed in the text Name Website CML Tools CMLXOM https://bitbucket.org/wwmm/cmlxom/ JUMBO http://sourceforge.net/projects/cml/ Cheminformatics Toolkits Chemistry Development Kit (CDK) http://cdk.sf.net Cinfony http://cinfony.googlecode.com Indigo http://ggasoftware.com/opensource/indigo JOELib http://sf.net/projects/joelib Open Babel http://openbabel.org RDKit http://rdkit.org Web Applications ChemDoodle Web Components http://web.chemdoodle.com Jmol http://jmol.org Integration Bioclipse http://www.bioclipse.net CDK-Taverna http://cdktaverna.wordpress.com Lensfield2 https://bitbucket.org/sea36/lensfield2/ Interconversion CIFXOM [95] https://bitbucket.org/wwmm/cifxom/ JUMBO-Converters https://bitbucket.org/wwmm/jumbo-converters/ OPSIN http://opsin.ch.cam.ac.uk OSRA http://osra.sf.net Structure Databases Bingo http://ggasoftware.com/opensource/bingo Chempound (Chem#) https://bitbucket.org/chempound Mychem http://mychem.sf.net OrChem http://orchem.sf.net pgchem http://pgfoundry.org/projects/pgchem/ Text mining ChemicalTagger [96] http://chemicaltagger.ch.cam.ac.uk/ OSCAR4 https://bitbucket.org/wwmm/oscar4/ Computational Chemistry Avogadro http://avogadro.openmolecules.net cclib http://cclib.sf.net GaussSum http://gausssum.sf.net QMForge http://qmforge.sf.net Computational Drug Design Confab [97] http://confab.googlecode.com Pharao http://silicos.be/download Piramid http://silicos.be/download Sieve http://silicos.be/download Stripper http://silicos.be/download Other Applications AMBIT2 http://ambit.sf.net Brunn http://brunn.sf.net Toxtree http://toxtree.sf.net XtalOpt http://xtalopt.openmolecules.net J. Cheminf. 2011, 3, 37.
  • 98. O’Boyle et al. Journal of Cheminformatics 2011, 3:37 Page 4 of 15 http://www.jcheminf.com/content/3/1/37 whereas others were made Open Source by releasing in- Second-generation tools house codebases under liberal licenses. When the Blue Although feature-rich and robust cheminformatics Obelisk was established five years ago, the primary toolkits are useful in and of themselves, they can also be toolkits under active development were the Chemistry seen as providing a base layer on which additional tools Development Kit (CDK) [5,6], Open Babel [7], and JOE- and applications can be built. This is one of the reasons Lib [8]. Of these, both the CDK and Open Babel con- that cheminformatics toolkits are so important to the tinue to be actively developed. open source ‘ecosystem’; their availability lowers the bar- The CDK project has been under regular development rier for the development of a ‘second generation’ of over the last five years. Several features have been chemistry software that no longer needs to concern implemented ranging from core components such as an itself with the low-level details of manipulating chemical extensible SMARTS matching system and a new graph structures, and can focus on providing additional func- (and subgraph) isomorphism method [9], to more appli- tionality and ease-of-use. Although a wide range of cation oriented components such as 3D pharmacophore chemistry software has been built using Blue Obelisk searching and matching, and a variety of structural-key components (see for example, the “Related Software” and hashed fingerprints. In addition, there have been a link on the Open Babel website, [13] listing over 40 pro- number of second generation tools developed on top of jects as of this writing, or “Software using CDK” at the the CDK (see below). As well as the use of the CDK in CDK website), in this section we focus on second-gen- various tools, it has been deployed in the form of web eration tools which themselves have been developed by services [10] and has formed the basis of a variety of members of the Blue Obelisk. web applications. Bioclipse [14] (v2.4 released in Aug 2010) and Avoga- Since 2006, major new features of Open Babel include dro [15] (v1.0 in Oct 2009) are two examples of such 3D structure generation and 2D structure-diagram gen- software, based on the CDK and Open Babel, respec- eration, UFF and MMFF94 forcefields, and significantly tively. Bioclipse (Figure 1) is an award-winning molecu- expanded support for computational chemistry calcula- lar workbench for life sciences that wraps tions. In addition, a major focus of Open Babel develop- cheminformatics functionality behind user-friendly inter- ment has been to provide for accurate conversion and faces and graphical editors while Avogadro (Figure 2) is representation in areas of stereochemistry, kekulisation, a 3D molecular editor and viewer aimed at preparing and canonicalisation. The project has also grown, in and analysing computational chemistry calculations. terms of new contributors, new support from commer- Both projects are designed to be extended or scripted by cial companies, and second-generation tools applying users through the provision of a plugin architecture and Open Babel to a variety of end-user applications, from scripting support (using Bioclipse Scripting Language molecular editors to chemical database systems. [16], or Python in the case of Avogadro). An interesting Two new Open Source cheminformatics toolkits have aspect of both Avogadro and Bioclipse is that they share appeared since the original paper. In 2006 Rational Dis- some developers with the underlying toolkits and this covery, a cheminformatics service company (since closed has driven the development of new features in the CDK down), released RDKit [11] under the BSD License. This and Open Babel. is a C++ library with Python and (more recently) Java Both products in turn act as extensible platforms for bindings. RDKit is actively developed and includes code other software. Bioclipse, for example is used by soft- donated by Novartis. Recent developments include the ware such as Brunn [17], a laboratory information sys- Java bindings, as well as performance improvements for tem for microplate based high-throughput screening. its database cartridge. Brunn provides a graphical interface for handling differ- More recently, GGA Software Services (a contract ent plate layouts and dilution series and can automati- programming company) released the Indigo toolkit [12] cally generate dose response curves and calculate IC50- and associated software in 2009 under the GPL. Indigo values. Avogadro is used by Kalzium [18], a periodic is a C++ library with high-level wrappers in C, Java, table and chemical editor in KDE, and XtalOpt [19,20], Python, and the .NET environment. Like RDKit and an evolutionary algorithm for crystal structure predic- other toolkits, Indigo provides support for tetrahedral tion. XtalOpt provides a graphical interface using Avo- and cis-trans stereochemistry, 2D coordinate generation, gadro and submits calculations using a range of solid- exact/substructure/SMARTS matching, fingerprint gen- state simulation software to predict stable polymorphs. eration, and canonical SMILES computation. It also pro- A final example of second-generation Blue Obelisk vides some less common functionality, like matching software is the AMBIT2 [21,22] software, which was tautomers and resonance substructures, enumeration of developed to facilitate registration of chemicals for the subgraphs, finding maximum common substructure of REACH EU directive, and is based on the CDK. It was N input structures, and enumerating reaction products. distributed initially as a standalone Java Swing GUI, and J. Cheminf. 2011, 3, 37.
  • 99. O’Boyle et al. Journal of Cheminformatics 2011, 3:37 Page 5 of 15 http://www.jcheminf.com/content/3/1/37 Figure 1 Screenshot of Bioclipse using Jmol to visualise a molecular surface. more recently as downloadable web application archive, predictive models, including modules of the open source offering a web services interface to a searchable chemi- Toxtree [22-24] software for toxicity prediction. cal structures database. Also integrated are descriptor calculations, as well as the ability to run and build Computational chemistry analysis Another area where the Blue Obelisk has had a signifi- cant impact in the past five years is in supporting quan- tum chemistry calculations and in interpreting their results. Electronic structure calculations have a long tra- dition in the chemistry community and a variety of pro- grams exist, mostly proprietary software but with an increasing number of open source codes. However, since each program uses different input formats, and the the output formats vary widely (sometimes even varying between different versions of the same software), prepar- ing calculations and automatically extracting the results is problematic. Avogadro has already been mentioned as a GUI for preparing calculations. It uses Open Babel to read the output of several electronic structure packages. Avoga- dro generates input files on the fly in response to user input on forms, as well as allowing inline editing of the Figure 2 Screenshot of Avogadro showing a depiction of a carbon nanotube. files before they are saved to disk. It also features intui- tive syntax highlighting for GAMESS input files, J. Cheminf. 2011, 3, 37.
  • 100. O’Boyle et al. Journal of Cheminformatics 2011, 3:37 Page 6 of 15 http://www.jcheminf.com/content/3/1/37 allowing expert users to easily spot mistakes before sav- Quixote will advertise the value of Open community ing an input file to disk. standards for semantics to the world. In addition to this, significant development of new The Quixote project is not dependent on any particu- parsing routines took place in an Avogadro plugin to lar technology, other than the representation of compu- read in basis sets and electronic structure output in tational chemistry in CML and the management of order to calculate molecular orbital and electron density semantics through CML dictionaries. At present, we use grids. This code was written to be parallel, using desk- JUMBO-Converters [29] for most of the semantic con- top shared memory parallelism and high level APIs in version, Lensfield2 [30] for the workflow and Chem- order to significantly speed up analysis. Most of this pound (chem#) [31] to store and disseminate the results. code was recently separated from the plugin, and released as a BSD licensed library, OpenQube, which is Web applications now used by the latest version of Avogadro. Jmol (see While desktop software has composed the majority of below) can also depict computational chemistry results scientific tools since the computer was introduced, the including molecular orbitals. internet continues to change how applications and con- In 2006, the Blue Obelisk project cclib [25] was estab- tent are distributed and presented. The web presents lished with the goal of parsing the output from compu- new opportunities for scientists as it is an open and free tational chemistry programs and presenting it in a medium to distribute scientific knowledge, ideas and standard way so that further analyses could be carried education. Web applications are software that runs out independently of the quantum package used. cclib is within the browser, typically implemented in Java or a Python library, and the current version (version 1.0.1) JavaScript. Recently, a new version of the HTML specifi- supports 8 different computational chemistry codes and cation, HTML5, defined a well-developed framework for extracts over 30 different calculated attributes. Two creating native web applications in JavaScript and this related Blue Obelisk projects build upon cclib. Gauss- opens up new possibilities for visualising chemical data. Sum [26], is a GUI that can monitors the progress of Jmol, the interactive 3D molecular viewer, is one of SCF and geometry convergences, and can plot predicted the most widely used chemistry applets, and indeed has UV/Vis absorption and infrared spectra from appropri- seen widespread use in other fields such as biology and ate logfiles containing energies and oscillator strengths even mathematics (it is used for 3D depiction of mathe- for easy comparison to experimental data. QMForge matical functions in the Sage Mathematics Projects [27] provides a GUI for various electronic structure ana- [32]). It is implemented in Java, and has gone from lyses such as Frenking’s charge decomposition analysis being a “Rasmol/Chime” replacement to a fully fledged [28] and Mulliken or C-squared analyses on user- molecular visualisation package, including full support defined molecular fragments. QMForge also provides a for crystallography [33], display of molecular orbitals rudimentary Cartesian coordinate editor allowing mole- from standard basis set/coefficient data, the inclusion of cular structures to be saved via Open Babel. dynamic minimisation using the UFF force field, and a The Quixote project epitomises the full use of the full implementation of Daylight SMILES and SMARTS, Blue Obelisk software and is described in detail in with extensions to conformational and biomolecular another article in this issue. Here we observe that it is substructure searching (Jmol BioSMARTS). possible to convert legacy chemistry file formats of all In 2009, iChemLabs released the ChemDoodle Web sorts into semantic chemistry and extract those parts Components library [34] under the GPL v3 license (with which are suitable for input to computational chemistry a liberal HTML exception). This library is completely programs. This chemistry is then combined with generic implemented in JavaScript and uses HTML5 to allow concepts of computational chemistry (e.g. strategy, the scientist to present publication quality 2D and 3D machine resources, timing, accuracy etc.) into the legacy graphics (see Figure 3) and animations for chemical inputs for a wide range of programs. Quixote itself fol- structures, reactions and spectra. Beyond graphics, this lows Blue Obelisk principles in that it does not manage tool provides a framework for user interaction to create the submission and monitoring of jobs but resumes dynamic applications through web browsers, desktop action when the jobs have been completed, and then platforms and mobile devices such as the iPhone, iPad applies a range of parsing and transformation tools to and Android devices. create standardised semantic chemical content. A major feature of Quixote is that it requires all concepts to vali- The business end date against dictionaries and the process of parsing files Open Source provides a unique opportunity for com- necessarily generates communally-agreed dictionaries, mercial organisations to work with the cheminformatics which represent an important step forward in the Open community. Traditional business models rely on moneti- specifications for Blue Obelisk. When widely-deployed, sation of source code, causing companies to repeat work J. Cheminf. 2011, 3, 37.
  • 101. O’Boyle et al. Journal of Cheminformatics 2011, 3:37 Page 7 of 15 http://www.jcheminf.com/content/3/1/37 exchange. This way, licensing issues are becoming a marginal problem, allowing companies to select a license appropriate for their business model. This too, allows a company to create a successful product with signifi- cantly reduced cost and effort. At the time of writing there are many commercial companies developing chemistry solutions around Open Source cheminformatics components provided by the Blue Obelisk community. Examples of such companies include iChemLabs, IdeaConsult, Wingu, Silicos, Genet- taSoft, eMolecules, hBar, Metamolecular, and Inkspot Figure 3 Screenshot of the MolGrabber 3D demo from Science. Some of these merely use components, but sev- ChemDoodle Web Components. eral actively contribute back to the Blue Obelisk project they use, or donate new Open Source cheminformatics projects to the community. done by other companies. This model is sometimes For example, iChemLabs released the ChemDoodle combined with a free (gratis) model for people working Web Components library under the GPL v3 license, at academic institutes, to increase adoption and encou- based on the upcoming HTML5 Open Standard. It rage contributions from academics. This solution defines allows making web and mobile interfaces for chemical the return on investment as the IP on the software, but content. The project is already being adopted by others, has the downside of investment losses due to duplica- including iBabel [39], ChemSpotlight [40] and the RSC tion of software and method development, which ChemSpider [41,42]. become visible when proprietary companies merge. Silicos has released several Open Source utilities [43] Some authors have argued that in the chemistry field based on Open Babel, such as Pharao, a tool for phar- few contributors are available to volunteer time to macophore searching, Sieve for filtering molecular struc- improve codes and IP considerations may prevent con- ture by molecular property, Stripper for removing core tributions from industry [35]. If true, this would hamper scaffold structures from a molecule set, and Piramid for adoption of Open Source and Open Data in chemistry, molecular alignment using shape determined by the and greatly slow the growth of projects such as those in Gaussian volumes as a descriptor. Additionally, contri- the Blue Obelisk. butions have been made to the Open Babel project The Blue Obelisk community, however, takes advan- itself. tage of the fact that much of the investment needed for Other companies use Blue Obelisk components and development is either paid for by academic institutes contribute patches, smaller and larger. For example, and funding schemes, or by volunteers investing time IXELIS donated the isomorphism code in the CDK, and effort. In return, contributors get full access to the eMolecules donated canonicalisation code to Open source code, and the Open Source licensing ensures Babel, Metamolecular improved the extensibility and that they will have access any time in the future. In this unit testing suite of OPSIN, and AstraZeneca contribu- way, the license functions as a social contract between ted code to the CDK for signatures. This is just a very everyone to arrange an immediate return on investment. minor selection, and the reader is encouraged to contact Effiectively, this approach shares the burden of the high the individual Blue Obelisk projects for a detailed list. investment in having to develop cheminformatics soft- In May 2011, a Wellcome Trust Workshop on Mole- ware from scratch, allowing researchers and commercial cular Informatics Open Source Software (MIOSS) partners alike to focus on their core business, rather explored the role of Open Source in industrial labora- than the development of prerequisites. In the case of the tories and companies as well as academia (several of the Blue Obelisk, the rich collection of Open Source che- presenters are among the authors of this paper). The minformatics tools provided greatly reduces investment meeting identified that Open Source software was extre- up front for new companies in the cheminformatics mely valuable to industry not just because it is available market. Such advantages have also been noted in the for free, but because it allows the validation of source drug discovery field [36-38]. code, data and computational procedures. Some of the The use of Open Standards allows everyone to select discussion was on business models or other ways to those Blue Obelisk components they find most useful, maintain development of Open Source software on as they can easily replace one component with another which a business relied. Companies are concerned about providing the same functionality, taking advantage that training and support and, in some cases, product liabi- they use the same standards for, for example, data lity. There are difficulties for software for which there is J. Cheminf. 2011, 3, 37.
  • 102. O’Boyle et al. Journal of Cheminformatics 2011, 3:37 Page 8 of 15 http://www.jcheminf.com/content/3/1/37 no formal transaction other than downloading and parsing of chemical names, followed by step-wise appli- agreeing to license terms. One anecdote concerned a cation of nomenclature rules. It is able to offer fast and company that wished to donate money to an Open precise conversions for the majority of names using Source project but could not find a mechanism to do so. IUPAC organic nomenclature, and is available as a web Industry participants also pointed out that there is a service, Java library and standalone application for maxi- considerable amount of contribution-in-kind from mum interoperability. industry, both from enhancements to software and also the development of completely new software and toolk- Chemical database software its. Companies are now finding it easier to create Registration, indexing and searching of chemical struc- mechanisms for releasing Open Source software without tures in relational databases is one of the core areas of violating confidentiality or incurring liability. A phrase cheminformatics. A number of structure registration from the meeting summed it up: “The ice is beginning systems have been published in the last five years, to melt”, signifying that we can expect a rapid increase exploiting the fact that Open Source cheminformatics in industry’s interest in Open Source. toolkits such as Open Babel and the CDK are available. OrChem [48], for example, is an open source extension Converting chemical names and images to structures for the Oracle 11G database that adds registration and The majority of chemical information is not stored in indexing of chemical structures to support fast substruc- machine-readable formats, but rather as chemical names ture and similarity searching. The cheminformatics or depictions. The OSRA and OPSIN projects focus on functionality is provided by the CDK. OrChem provides extracting chemical information from these sources. similarity searching with response times in the order of Such software plays a particularly important role for seconds for databases with millions of compounds, data mining the chemical literature, including patents depending on a given similarity cut-off. For substructure and theses. searching, it can make use of multiple processor cores Optical Structure Recognition Application (OSRA) on today’s powerful database servers to provide fast [44] was started in early 2007 with the goal to create response times in equally large data sets. the first free and open source tool for extraction and Besides the traditional and proven relational database conversion of molecular images into SMILES and SD approach with added chemical features (’cartridges’), files. From the very beginning the underlying philosophy there is growing interest in tools and approaches based was to integrate existing open source libraries and to on the web philosophy and practice. Several groups avoid “reinventing the wheel” wherever possible. OSRA [49,50] are experimenting with the Resource Description relies on a variety of open source components: Open Framework (RDF) language on the assumption that gen- Babel for chemical format conversion and molecular eric high-performance solutions will appear. RDF allows property calculations, GraphicsMagick for image manip- everything to be described by URIs (data, molecules, ulation, Potrace for vectorisation, GOCR and OCRAD dictionaries, relations). The Chempound system [31], as for optical character recognition. The growing impor- deployed in Quixote and elsewhere, is an RDF-based tance of image recognition technology can be seen in approach to chemical structures and compounds and the fact that only a few years ago there was only one their properties. For small to medium-sized collections widely available software package for chemical structure (such as an individual’s calculations or literature retrie- recognition - CLiDE (commercially developed at Key- val), there are many RDF tools (e.g. SIMILE, Apache module, Ltd), but today there are as many as seven Jena) which can operate in machine memory and pro- available programs. vide the flexibility that RDF offers. For larger systems, it OPSIN (Open Parser for Systematic IUPAC Nomen- is unclear whether complete RDF solutions (e.g. Vir- clature) [45] focuses instead on interpreting chemical tuoso) will be satisfactory or whether a hybrid system names. The chemical name is the oldest form of com- based on name-value pairs (e.g. CouchDB, MongoDB) munication used to describe chemicals, predating even will be sufficient. the knowledge of the atomic structure of compounds. Chemical names are abundant in the scientific literature Collaboration and interoperability and encode valuable structural information. Through One of the successes of the Blue Obelisk has been to successive books of recommendations [46,47], IUPAC bring developers together from different Open Source has tried to codify and to an extent standardise naming chemistry projects so that they look for opportunities to practices. OPSIN aims to make this abundance of che- collaborate rather than compete, and to leverage work mical names machine readable by translating them to done by other projects to avoid duplication of effort. As SMILES, CML or InChI. The program is based around an example of this, when in March 2008 the Jmol devel- the use of a regular grammar to guide tokenisation and opment team were looking to add support for energy J. Cheminf. 2011, 3, 37.
  • 103. O’Boyle et al. Journal of Cheminformatics 2011, 3:37 Page 9 of 15 http://www.jcheminf.com/content/3/1/37 minimisation, rather than implement a forcefield from the danger of vendor lock-in (where users are con- scratch they ported the UFF forcefield [51] implementa- strained to using a particular software, a situation which tion from Open Babel to Jmol. This code enables Jmol puts them at a disadvantage). This applies as much to to support 2D to 3D conversion of structures (through Open Source software as to proprietary software. Cinf- energy minimisation). In a similar manner, efficient Jmol ony is a project (first release in May 2008) whose goal is code for atom-atom rebonding has been ported to the to tackle this problem in the area of cheminformatics CDK. Figure 4 shows the collaborative nature of soft- toolkits [56]. It is a Python library that enables Open ware developed in the Blue Obelisk, as one project Babel, the CDK, and RDKit (and shortly, Indigo and builds on functionality provided by another project. OPSIN) to be used using the same API; this makes it Another collaborative initiative between Blue Obelisk easy, for example, to read a molecule using Open Babel, projects was the establishment in May 2008 of the Che- calculate descriptors using the CDK and create a depic- miSQL project. This brought together the developers of tion using RDKit. several open source chemistry database cartridges Another way through which interoperability of Blue (PgChem [52], Mychem [53], OrChem [48] and more Obelisk projects has been promoted and developed is recently Bingo [54]) with a view to making their data- through integration into workflow software such as base APIs more similar and collaborating on benchmark Taverna [57] and KNIME [58] (both open source). Such datasets for assessing performance. For two of these software makes it easy to automate recurring tasks, and projects, PgChem and Mychem, which are both based to combine analyses or data from a variety of different on Open Babel, there is the additional possibility of software and web services. A combination of the Chem- working together on a shared codebase. istry Development Kit and Taverna, for instance, was In the area of cheminformatics toolkits, two of the reported in 2010 [59]. In the case of KNIME, it comes existing toolkits Open Babel and RDKit are planning to with built-in basic collection of CDK-based and Open work together on a common underlying framework Babel-based nodes, while other nodes for the RDKit and called MolCore [55]. This project is still in the planning Indigo are available from KNIME’s “Community stage, but if it is a success it will mean that the the two Updates” site. libraries will be interoperable (while retaining their existing focus) but also that the cost of maintaining the Open Standards code will be shared among more developers, freeing Chemical Markup Language, CML time for the development of new features. Chemical Markup Language (CML) is discussed in sev- One of the goals of the Blue Obelisk is to promote eral articles in this issue, and a brief summary here re- interoperability in chemical informatics. When barriers iterates that it is designed primarily to create a validata- exist to moving chemical data between different soft- ble semantic representation for chemical objects. The ware, the community becomes fragmented and there is five main areas (molecules, reactions, computational Figure 4 Dependency diagram of some Blue Obelisk projects. Each block represents a project. Square blocks show Open Data, ovals are Open Source, and diamonds are Open Standards. J. Cheminf. 2011, 3, 37.
  • 104. O’Boyle et al. Journal of Cheminformatics 2011, 3:37 Page 10 of 15 http://www.jcheminf.com/content/3/1/37 chemistry, spectra and solid-state (see above)) have now calls to libraries written in languages such as C, C++ all been extensively deployed and tested. CML can and Fortran, and compiled into native, machine specific, therefore be used as a reference for input and output code. JNI-InChI provides a thin C wrapper, with corre- for Blue Obelisk software and a means of representing sponding Java code, around the IUPAC InChI library, data in Blue Obelisk resources. exposing the InChI library’s functionality to the JVM. CML, being an XML application, can inter-operate To overcome the need to have the correct InChI library with other markup languages and in particular XHTML, pre-installed on a system, JNI-InChI comes with a vari- SVG, MathML, docx and more specialised applications ety of precompiled native binaries and automatically such as UnitsML and GML (geosciences). We believe extracts and deploys the correct one for the detected that it would be possible using these languages to operating system and architecture. The JNI-InChI encode large parts of, say, first year chemistry text library comes with native binaries supporting a range of books in XML. Similarly, it is possible to create com- operating systems and architectures; the current version pound documents with word processing or spreadsheet has binaries for 32- and 64-bit Windows, Linux and software that have inter-operating text, graphics and Solaris, 64-bit FreeBSD and 64-bit Intel-based Mac OS chemistry (as in Chem4Word). Being a markup lan- X - a number of which are not supported by the original guage, CML is designed for re-purposing, including sty- IUPAC distribution of InChI. The JNI-InChI project has ling, and therefore a mixture of these languages can be matured to support the full range of functionality of the used for chemical catalogues, general publications, log- InChI C library: structure-to-InChI, InChI-to-structure, books and many other types of document in the scienti- AuxInfo-to-structure, InChIKey generation, and InChI fic process. and InChIKey validation. JNI-InChI provides the InChI CML describes much of its semantics through conven- functionality for a number of Open Source projects, tions and dictionaries, and the emerging ecosystem including the Chemistry Development Kit, Bioclipse and (especially in computational chemistry) is available as a CMLXOM/JUMBO, and is also used by commercial semantic resource for many of the applications and spe- applications and internally in a number of companies. cifications in this article. Through its widespread use and Open Source develop- ment model, a number of issues in earlier versions of InChI the software have been identified and resolved, and JNI- The IUPAC InChI identifier is a non-proprietary and InChI now offers a robust tool for working with InChIs unique identifier for chemical substances designed to in the JVM. enable linking of diverse data compilations. Prior to the development of the InChI identifier chemical informa- OpenSMILES tion systems and databases used a wide variety of (gen- One of the most widely used ways to store chemical erally proprietary) identifiers, greatly limiting their structures is the SMILES format (or SMILES string). interoperability. Although its development predates the This is a linear notation developed by Daylight Informa- Blue Obelisk, software such as Open Babel has included tion Systems that describes the connection table of a InChI support since 2005, and support for InChI in molecule and may optionally encode chirality. Its popu- Indigo is due in 2011. larity stems from the fact that it is a compact represen- Since the official InChI implementation is in C, it is tation of the chemical structure that is human readable difficult to access from the other widely used language and writable, and is convenient to manipulate (e.g. to for cheminformatics toolkits, Java. Early attempts to include in spreadsheets, or copy from a web page). generate InChI identifiers from within Java involved Despite its widespread use, a formal definition of the programatically launching the InChI executable and cap- language did not exist beyond Daylight’s SMILES The- turing the output, an approach that was found to be ory Manual and tutorials. This caused some confusion fairly unreliable and broke the ‘write once, run any- in the implementation and interpretation of corner where’ philosophy of Java. The Blue Obelisk project JNI- cases, for example the handling of cis/trans bond sym- InChI [60] was established in 2006 to solve this problem bols at ring closures. In 2007, Craig James (eMolecules) by using the Java Native Interface framework to provide initiated work on the OpenSMILES specification [61], a transparent access to the InChI library from within Java complete specification of the SMILES language as an and other Java Virtual Machine (JVM) based languages, Open Standard developed through a community pro- supporting the wider adoption of this standard identifier cess. The specification is largely complete and contains by the chemistry community. guidelines on reading SMILES, a formal grammar, The Java Native Interface framework provides a recommendations on standard forms when writing mechanism for code running inside the JVM, to place SMILES, as well as proposed extensions. J. Cheminf. 2011, 3, 37.
  • 105. O’Boyle et al. Journal of Cheminformatics 2011, 3:37 Page 11 of 15 http://www.jcheminf.com/content/3/1/37 QSAR-ML has their own set of rules. A common reference specifi- The field of QSAR has long been hampered by the lack cation for standardisation would be of immense value in of open standards, which makes it difficult to share and interoperability between structure repositories as well as reproduce descriptor calculations and analyses. QSAR- between toolkits (though the latter is still confounded ML was recently proposed as an open standard for by differences in lower level cheminformatic features exchanging QSAR datasets [62]. A dataset in QSAR-ML such as aromaticity models). includes the chemical structures (preferably described in We have already discussed the development of an CML) with InChI to protect integrity, chemical descrip- Open SMILES standard. While much progress has been tors linked to the Blue Obelisk Descriptor Ontology made towards a complete specification, more remains to [63], response values, units, and versioned descriptor be done before this can be considered finished. After implementations to allow descriptors from different that point, the next logical step would be to start work software to be integrated into the same calculation. on a standard for the SMARTS language, the extension Hence, a dataset described in QSAR-ML is completely to SMILES that specifies patterns that match chemical reproducible. To allow for easy setup of QSAR-ML substructures. compliant datasets, a plugin for Bioclipse was created with a graphical interface for setting up QSAR datasets Open Data and performing calculations. Descriptor implementa- A considerable stumbling block in advocating the tions are available from the CDK and JOELib, as well as release of scientific data as Open Data has been how via remote web services such as XMPP [64]. exactly to define “Open.” A major step forward was the launch in 2010 of the Panton Principles for Open Data Remaining challenges in Science [66]. This formalises the idea that Open Data A core requirement for chemical structure databases maximises the possibility of reuse and repurposing, the and chemical registration systems in general is the fundamental basis of how science works. These princi- notion of structure standardisation. That is, for a given ples recommend that published data be licensed expli- input structure, multiple representations should be con- citly, and preferably under CC0 (Creative Commons ‘No verted to one canonical form. Structure canonicalisation Rights Reserved’, also known as CCZero) [67]. This routines partially address this aspect, converting multi- license allows others to use the data for any purpose ple alternative topologies to a single canonical form. whatsoever without any barriers. Other licenses compa- However, the problem of standardisation is broader tible with the Panton Principles include the Open Data than just topological canonicalisation. Features that Commons Public Domain Dedication and Licence must be considered include (PDDL), the Open Data Commons Attribution License, and the Open Data Commons Open Database License • topological canonicalisation (ODbL) [68]. • handling of charges Despite this positive news, little chemical data compa- • tautomer enumeration and canonicalisation tible with these principles has become available from • normalisation of functional groups the traditional chemistry fields of organic, inorganic, and solid state chemistry. Table 2 lists a few notable excep- Currently, most of the individual components of a tions, some of which are discussed further below. There ‘standardisation pipeline’ can be implemented using is also data available using licenses not compatible with Blue Obelisk tools. The larger problem is that there is the Panton Principles, but where the user is allowed to no agreed upon list of steps for a standardisation pro- modify and redistribute the data. A new data set in this cess. While some specifications have been published (e. category is the data from the ChEMBL database [69], g., PubChem) and some standardisation services and which is available under the Creative Commons Share- tools are available (for example, PubChem provides an Alike Attribution license. The RSC ChemSpider data- online service to standardise molecules [65]) each group base [41], although not fully Open, also hosts Open Table 2 Open Data in chemistry. Name License/Waiver Description Chempedia [98] CC0 Crowd-sourced chemical names (project discontinued but data still available) CrystalEye PPDL Crystal structures from primary literature ONS Solubility CC0 Solubility data for various solvents Reaction Attempts CC0 Data on successful and unsuccessful reactions Overview of major open chemical data available under a license or waiver compatible with the Panton Principles. J. Cheminf. 2011, 3, 37.
  • 106. O’Boyle et al. Journal of Cheminformatics 2011, 3:37 Page 12 of 15 http://www.jcheminf.com/content/3/1/37 Data; for example, spectral data when deposited can be are available under a CC0 license. Several web services marked as Open. and feeds are available to filter and re-use the dataset Importantly, publishing data as CC0 is becoming [80]. In particular, models have been developed for the easier now that websites are becoming available to sim- prediction of non-aqueous solubility in 72 different sol- plify publishing data. Two projects that can be men- vents [81] using the method of Abraham et al [82] with tioned in this context are FigShare [70], where the data descriptors calculated by the Chemistry Development behind unpublished figures can be hosted, and Dryad Kit. These models are available online and will be [71] where data behind publications can be hosted. refined as more solubility data is collected. Initiatives like this make it possible to host small amounts of data, and those combined are expected to The Blue Obelisk Data Repository (BODR) become soon a substantial knowledge base. The Blue Obelisk has created a repository of key chemi- cal data in a machine-readable format [83]. The BODR Reaction Attempts focuses on data that is commonly required for chemistry Although there are existing databases that allow for software, and where there is a need to ensure that searching reactions, those using Open Data are harder values are standard between codes. Examples are atomic to find. The Reaction Attempts database [72], to which masses and conversions between physical constants. anyone can submit reaction attempts data, consists These data can be used by others for any purpose (for mainly of reaction information abstracted from Open example, for entry into Wikipedia or use in in-house Notebooks in organic chemistry, such as the Useful- software), and should lead to an enhancement in the Chem project from the Bradley group [73] and the note- quality of community reference data. The Blue Obelisk books from the Todd group [74]. Key information from provides also a complementary project, the Chemical each experiment is abstracted manually, with the only Structure Repository [83]. It aims to provide 3D coordi- required information consisting of the ChemSpider IDs nates, InChIs and several physico-chemical descriptors of the reactants and the product targeted in the experi- for a set of 570 organic compounds. ment; and a link to the laboratory notebook page. Infor- mation in the database can be searched and accessed NMRShiftDB using the web-based Reaction Attempts Explorer [75]. NMRShiftDB [84,85] represents one of the earliest Since the database reflects all data from the note- resources for Open community-contributed data (first books, it includes experiments in progress, ambiguous released in 2003). Research groups that measure NMR results and failed runs. Unlike most reaction databases spectra or extract it from the literature can contribute that only identify experiments successfully reported in that information to NMRShiftDB which provides an the literature, the Reaction Attempts Explorer allows Open resource where entries can be searched by chemi- researchers to easily find patterns in reactions that have cal structure or properties (especially peaks). Although already been performed, and since the data are open it is difficult to encourage large amounts of altruistic and results are reported across all research groups, contribution (as happens with Wikipedia), an alternative intersections are easily discovered and possible Open possible source of data could come from linking data Collaboration opportunities are easily found [76,77]. capture with data publication. For example, the Blue Obelisk has enough software that it is possible to create Non-Aqueous Solubility a seamless chain for converting NMR structures in- Although the aqueous solubility of many common house into NMRShiftDB entries. If and when the chem- organic compounds is generally available, quantitative istry community encourages or requires semantic publi- reports of non-aqueous solubility are more difficult to cation of spectra rather than PDFs, it would be possible find. Such information can be valuable for selecting sol- to populate NMRShiftDB rapidly along the the lines of vents for reactions, re-crystallization and related pro- CrystalEye (see below). A similar approach has been cesses. In 2008, the Open Notebook Science Solubility demonstrated earlier using the Blue Obelisk components Challenge was launched for the purpose of measuring Oscar and Bioclipse using text mining approaches [86]. non-aqueous solubility of organic compounds, reporting all the details of the experiments in an Open Notebook CrystalEye and recording the results as Open Data in a centralized CrystalEye [87] is an example of cost-effiective extrac- database [78,79]. This crowdsourcing project was also tion of data from the literature where this is published supported by Submeta, Sigma-Aldrich, Nature Publish- both Openly and semantically. Software extracts ing Group and the Royal Society of Chemistry. The Openly-published crystal structures from a variety of database currently holds 1932 total measurements and scholarly journals, processes them and then makes them 1428 averaged solute/solvent measurements all of which available through a web interface. It currently contains J. Cheminf. 2011, 3, 37.
  • 107. O’Boyle et al. Journal of Cheminformatics 2011, 3:37 Page 13 of 15 http://www.jcheminf.com/content/3/1/37 about 250,000 structures. CrystalEye serves as a model with the wider chemistry community outside of the Blue for a high-value, high-quality Open data resource, Obelisk remains an open question. If the Blue Obelisk is including the licensing of each component as Panton- truly to make an impact, then an attempt must be made compatible Open data. to reach beyond the subscribers to the Blue Obelisk mailing list and blogs of members. Other areas of activity We hope to see this involvement between the Blue While each Blue Obelisk project has its own website and Obelisk and the wider community grow in the future. point of contact (typically a mailing list), because of the To this end, we encourage the reader to visit the Blue breadth of Blue Obelisk projects it can be difficult for a Obelisk website [94], send a message to our mailing list, newcomer to understand which of them, if any, can best investigate related projects or read our blogs. address a particular problem. To address this issue, members of the Blue Obelisk established a Question Acknowledgements Answer website [88] (see Figure 5). This is a website in NMOB is supported by a Health Research Board Career Development the style of Stack Overflow [89] that encourages high Fellowship (PD/2009/13). The OSRA project has been funded in whole or in quality answers (and questions) through the use of a part with federal funds from the National Cancer Institute, National Institutes of Health, under contract HHSN261200800001E. The content of this voting system. In the year since it was established, over publication does not necessarily reflect the views of the policies of the 200 users have registered, many of whom had no pre- Department of Health and Human Services, nor does mention of trade vious involvement with the Blue Obelisk, showing that names, commercial products, or organisations imply endorsement by the US Government. the QA website complements earlier existing channels of communication. Author details 1 The rise of self-publishing and print-on-demand ser- Analytical and Biological Chemistry Research Facility, Cavanagh Pharmacy Building, University College Cork, College Road, Cork, Co. Cork, Ireland. 2NIH vices has meant that publishing a book is now as Center for Translational Therapeutics, 9800 Medical Center Drive, Rockville, straightforward as uploading to an appropriate website. MD 20878, USA. 3Division of Molecular Toxicology, Institute of Environmental Unlike the traditional publishing route where books Medicine, Nobels väg 13, Karolinska Institutet, 171 77 Stockholm, Sweden. 4 Unilever Centre for Molecular Sciences Informatics, Department of with projected low sales volume would be expensive, Chemistry, University of Cambridge, Lensfield Road, CB2 1EW, UK. websites such as Lulu [90] allow the sale of low-priced 5 Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 751 books on chemistry software, and books are now avail- 24 Uppsala, Sweden. 6Department of Chemistry, Drexel University, 32nd and Chestnut streets, Philadelphia, PA 19104, USA. 7Chemical Biology Laboratory, able for purchase on Jmol [91], the Chemistry Develop- Basic Research Program, SAIC-Frederick, Inc., NCI-Frederick, Frederick, MD ment Kit [92] and Open Babel [93]. 21702, USA. 8St. Olaf College, 1520 St. Olaf Ave., Northfield, MN 55057, USA. 9 Kitware, Inc., 28 Corporate Drive, Clifton Park, NY 12065, USA. 10Department of Chemistry, University of Pittsburgh, 219 Parkman Avenue, Pittsburgh, PA Conclusions 15260, USA. 11eMolecules Inc., 380 Stevens Ave., Solana Beach, California We have shown that the Blue Obelisk has been very 92075, USA. 12Ideaconsult Ltd., 4.A.Kanchev str., Sofia 1000, Bulgaria. 13 successful in bringing together researchers and develo- Department of Engineering, Computer Science, Physics, and Mathematics, Oral Roberts University, 7777 S. Lewis Ave. Tulsa, OK 74171, USA. 14Leiden pers with common interests in ODOSOS, leading to Institute of Chemistry, Leiden University, Einsteinweg 55, 2333 CC Leiden, development of many useful resources freely available to The Netherlands. 15Department of Chemistry, State University of New York at the chemistry community. However, how best to engage Buffalo, Buffalo, NY 14260-3000, USA. 16Université de Strasbourg, IPHC, CNRS, UMR7178, 23 rue du Loess 67037, Strasbourg, France. 17GGA Software Services LLC, 41 Nab. Chernoi rechki 194342, Saint Petersburg, Russia. 18 Cheminformatics and Metabolism Team, European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. 19 Department of Chemistry, University of Washington, Seattle, WA 98195, USA. 20iChemLabs, 200 Centennial Ave., Suite 200, Piscataway, NJ 08854, USA. Authors’ contributions The overall layout of the manuscript grew from discussions between NMOB, RG and ELW. The authorship of the paper is drawn from those people connected with fully Open Data/Standards/Source (OSI-compliant or OKF- compliant) projects associated with the Blue Obelisk. There are a large number of people contributing to these projects and because those projects are published in their own right it is not appropriate to include all their developers by default. We invited a number of ‘project gurus’ who have been active in promoting the Blue Obelisk, to be authors on this paper and most have accepted and contributed. Competing interests The authors declare that they have no competing interests. Figure 5 Screenshot of the Blue Obelisk eXchange Question Received: 1 July 2011 Accepted: 14 October 2011 and Answer website. Published: 14 October 2011 J. Cheminf. 2011, 3, 37.
  • 108. O’Boyle et al. Journal of Cheminformatics 2011, 3:37 Page 14 of 15 http://www.jcheminf.com/content/3/1/37 References 34. ChemDoodle Web Components: HTML5 Chemistry. [http://web. 1. Matos PD, Alcantara R, Dekker A, Ennis M, Hastings J, Haug K, Spiteri I, chemdoodle.com]. Turner S, Steinbeck C: Chemical Entities of Biological Interest: an update. 35. Stahl MT: Open-source software: not quite endsville. Drug Discov Today Nucleic Acids Res 2009, 38:D249-D254. 2005, 10:219-22. 2. Murray-Rust P: The Blue Obelisk. CDK News 2005, 2:43-46. 36. DeLano WL: The case for open-source software in drug discovery. Drug 3. Guha R, Howard MT, Hutchison GR, Murray-Rust P, Rzepa H, Steinbeck C, Discov Today 2005, 10:213-7. Wegner J, Willighagen EL: The Blue Obelisk - Interoperability in Chemical 37. Munos B: Can open-source RD reinvigorate drug research? Nat Rev Drug Informatics. J Chem Inf Model 2006, 46:991-998. Discov 2006, 5:723-9. 4. Murray-Rust P, Rzepa HS: Chemical Markup, XML, and the Worldwide 38. Geldenhuys WJ, Gaasch KE, Watson M, Allen DD, Van der Schyf CJ: Web. 1. Basic Principles. J Chem Inf Comput Sci 1999, 39:928-942. Optimizing the use of open-source software applications in drug 5. Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E: The discovery. Drug Discov Today 2006, 11:127-32. Chemistry Development Kit (CDK): An Open-Source Java Library for 39. iBabel. [http://homepage.mac.com/swain/Sites/Macinchem/page65/ibabel3. Chemo- and Bioinformatics. J Chem Inf Comput Sci 2003, 43:493-500. html]. 6. Steinbeck C, Hoppe C, Kuhn S, Floris M, Guha R, Willighagen EL: Recent 40. ChemSpotlight. [http://chemspotlight.openmolecules.net]. developments of the chemistry development kit (CDK) - an open-source 41. ChemSpider - the free chemical database. [http://www.chemspider.com]. java library for chemo- and bioinformatics. Curr Pharm Design 2006, 42. iChemLabs and RSC ChemSpider announce partnership. [http://www. 12:2111-2120. chemspider.com/blog/ichemlabs-and-rsc-chemspider-announce-partnership. 7. Open Babel. [http://openbabel.org]. html]. 8. JOELib. [http://sf.net/projects/joelib]. 43. Silicos Open Source Software. [http://silicos.silicos-it.com/download.html]. 9. Rahman SA, Bashton M, Holliday GL, Schrader R, Thornton JM: Small 44. OSRA. [http://cactus.nci.nih.gov/osra/]. Molecule Subgraph Detector (SMSD) Toolkit. J Cheminf 2009, 1:12. 45. Lowe DM, Corbett PT, Murray-Rust P, Glen RC: Chemical Name to 10. Dong X, Gilbert K, Guha R, Heiland R, Kim J, Pierce M, Fox G, Wild D: A Structure: OPSIN, an Open Source Solution. J Chem Inf Model 2011, Web Service Infrastructure for Chemoinformatics. J Chem Inf Model 2007, 51:739-753. 47:1303-1307. 46. IUPAC: Nomenclature of Organic Chemistry Pergamon Press, Oxford; 1979. 11. RDKit. [http://rdkit.org]. 47. IUPAC: A Guide to IUPAC Nomenclature of Organic Compounds 12. Indigo. [http://ggasoftware.com/opensource/indigo]. (Recommendations 1993) Blackwell Scientific publications, Oxford; 1993. 13. Open Babel - Related Software. [http://openbabel.org/wiki/ 48. Rijnbeek M, Steinbeck C: OrChem - An open source chemistry search Related_Projects]. engine for Oracle(R). J Cheminf 2009, 1:17. 14. Spjuth O, Helmus T, Willighagen EL, Kuhn S, Eklund M, Wagener J, Murray- 49. Willighagen EL, Brändle MP: Resource description framework technologies Rust P, Steinbeck C, Wikberg JES: Bioclipse: an open source workbench in chemistry. J Cheminf 2011, 3:15. for chemo- and bioinformatics. BMC Bioinformatics 2007, 8:59. 50. Chen B, Dong X, Jiao D, Wang H, Zhu Q, Ding Y, Wild DJ: Chem2Bio2RDF: 15. Avogadro: an open-source molecular builder and visualization tool. a semantic framework for linking and data mining chemogenomic and [http://avogadro.openmolecules.net]. systems chemical biology data. BMC Bioinformatics 2010, 11:255. 16. Spjuth O, Alvarsson J, Berg A, Eklund M, Kuhn S, Masak C, Torrance G, 51. Rappé AK, Casewit CJ, Colwell KS, Goddard WA III, Skiff WM: UFF, a full Wagener J, Willighagen E, Steinbeck C, Wikberg J: Bioclipse 2: A periodic table force field for molecular mechanics and molecular scriptable integration platform for the life sciences. BMC Bioinformatics dynamics simulations. J Am Chem Soc 1992, 114:10024-10035. 2009, 10:397. 52. PgChem. [http://pgfoundry.org/projects/pgchem/]. 17. Alvarsson J, Andersson C, Spjuth O, Larsson R, Wikberg J: Brunn: An open 53. Mychem. [http://mychem.sf.net]. source laboratory information system for microplates with a graphical 54. Bingo. [http://ggasoftware.com/opensource/bingo]. plate layout design process. BMC Bioinformatics 2011, 12:179. 55. MolCore. [http://molcore.sf.net]. 18. Kalzium - Periodic Table and Chemistry in KDE. [http://edu.kde.org/ 56. O’Boyle NM, Hutchison GR: Cinfony-combining Open Source applications/science/kalzium/]. cheminformatics toolkits behind a common interface. Chem Cent J 2008, 19. XtalOpt - Evolutionary Crystal Structure Prediction. [http://xtalopt. 2:24. openmolecules.net]. 57. Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T: 20. Lonie DC, Zurek E: XtalOpt: An open-source evolutionary algorithm for Taverna: a tool for building and running workflows of services. Nucleic crystal structure prediction. Comput Phys Commun 2011, 182:372-387. Acids Res 2006, 34:W729-W732. 21. Jeliazkova N, Jeliazkov V: AMBIT RESTful web services: an implementation 58. KNIME. [http://www.knime.org]. of the OpenTox application programming interface. J Cheminf 2011, 3:18. 59. Kuhn T, Willighagen E, Zielesny A, Steinbeck C: CDK-Taverna: an open 22. Jeliazkova N, Jaworska J, Worth A: Open Source Tools for Read-Across and workflow environment for cheminformatics. BMC Bioinformatics 2010, Category Formation. In In Silico Toxicology : Principles and Applications. 11:159. Edited by: Cronin M, Madden J. Cambridge UK: RSC Publishing; 60. JNI-InChI. [http://jni-inchi.sf.net/index.html]. 2010:408-445. 61. The OpenSMILES specification. [http://opensmiles.org]. 23. ToxTree. [http://toxtree.sf.net]. 62. Spjuth O, Willighagen EL, Guha R, Eklund M, Wikberg JE: Towards 24. Patlewicz G, Jeliazkova N, Safford RJ, Worth AP, Aleksiev B: An evaluation of interoperable and reproducible QSAR analyses: Exchange of datasets. J the implementation of the Cramer classification scheme in the Toxtree Cheminf 2010, 2:5. software. SAR QSAR Environ Res 2008, 19:495-524. 63. The Blue Obelisk Descriptor Ontology. [http://qsar.sourceforge.net/dicts/ 25. O’Boyle NM, Tenderholt AL, Langner KM: cclib: A library for package- qsar-descriptors/index.xhtml]. independent computational chemistry algorithms. J Comp Chem 2008, 64. Wagener J, Spjuth O, Willighagen EL, Wikberg JES: XMPP for cloud 29:839-845. computing in bioinformatics supporting discovery and invocation of 26. GaussSum. [http://gausssum.sf.net]. asynchronous web services. BMC Bioinformatics 2009, 10:279. 27. QMForge. [http://qmforge.sf.net]. 65. PubChem Standardization Service. [http://pubchem.ncbi.nlm.nih.gov// 28. Dapprich S, Frenking G: Investigation of Donor-Acceptor Interactions: A standardize/standardize.cgi]. Charge Decomposition Analysis Using Fragment Molecular Orbitals. J 66. Panton Principles - Principles for Open Data in Science. [http:// Phys Chem 1995, 99:9352-9362. pantonprinciples.org]. 29. JUMBO-Converters. [https://bitbucket.org/wwmm/jumbo-converters]. 67. About CC0 - “No Rights Reserved”. [http://creativecommons.org/about/ 30. Lensfield 2. [https://bitbucket.org/sea36/lensfield2]. cc0]. 31. Chempound. [https://bitbucket.org/chempound/chempound]. 68. Open Licenses - Data. [http://www.opendefinition.org/licenses/#Data]. 32. Stein W, et al: Sage Mathematics Software The Sage Development Team; 69. Overington J: ChEMBL. An interview with John Overington, team leader, 2011 [http://www.sagemath.org]. chemogenomics at the European Bioinformatics Institute Outstation of 33. Hanson RM: Jmol - a paradigm shift in crystallographic visualization. J the European Molecular Biology Laboratory (EMBL-EBI). Interview by Appl Cryst 2010, 43:1250-1260. Wendy A. Warr. J Comp Aided Mol Des 2009, 23:195-198. J. Cheminf. 2011, 3, 37.
  • 109. O’Boyle et al. Journal of Cheminformatics 2011, 3:37 Page 15 of 15 http://www.jcheminf.com/content/3/1/37 70. FigShare. [http://figshare.com]. 71. Dryad. [http://datadryad.org]. 72. Reaction Attempts Database. [http://onswebservices.wikispaces.com/ reactions]. 73. Bradley JC: Useful Chemistry: Reaction Attempts Book Edition 1 and UsefulChem Archive.[http://usefulchem.blogspot.com/2010/04/reaction- attempts-book-edition-1-and.html]. 74. Bradley JC: Useful Chemistry: The Synaptic Leap Experiments on Reaction Attempts.[http://usefulchem.blogspot.com/2010/05/synaptic-leap- experiments-on-reaction.html]. 75. Bradley JC: Useful Chemistry: Reaction Attempts Explorer.[http:// usefulchem.blogspot.com/2010/06/reaction-attempts-explorer.html]. 76. Bradley JC: Useful Chemistry: Visualizing Social Networks in Open Notebooks.[http://usefulchem.blogspot.com/2010/12/visualizing-social- networks-in-open.html]. 77. Bradley JC, Lang AS, Koch S, Neylon C: Collaboration using Open Notebook Science in Academia. In Collaborative computational technologies for biomedical research. Edited by: Ekins S, Hupcey MA, Williams AJ. Hoboken N.J.: John Wiley 2011:425-452. 78. Bradley JC, Neylon C, Guha R, Williams AJ, Hooker B, Lang ASID, Friesen B, Bohinski T, Bulger D, Federici M, Hale J, Mancinelli J, Mirza KB, Moritz MJ, Rein D, Tchakounte C, Truong HT: Open Notebook Science Challenge: Solubilities of Organic Compounds in Organic Solvents. Nature Precedings 2010 [http://dx.doi.org/10.1038/npre.2010.4243.3]. 79. Bradley J, Guha R, Lang A, Lindenbaum P, Neylon C, Williams A, Willighagen E: Beautifying Data in the Real World. In Beautiful Data.. 1 edition. Edited by: Segaran T, Hammerbacher J. Sebastopol CA: O’Reilly; 2009:259-278. 80. Open Notebook Solubility Web Services. [http://onswebservices. wikispaces.com/solubility]. 81. Bradley J: Useful Chemistry: General Transparent Solubility Prediction using Abraham Descriptors.[http://usefulchem.blogspot.com/2010/07/ general-transparent-solubility.html]. 82. Abraham MH, Smith RE, Luchtefeld R, Boorem AJ, Luo R, Acree WE Jr: Prediction of solubility of drugs and other compounds in organic solvents. J Pharm Sci 2010, 99:1500-1515. 83. Blue Obelisk Data Repository. [http://bodr.sf.net]. 84. NMRShiftDB. [http://www.nmrshiftdb.org]. 85. Steinbeck C, Kuhn S: NMRShiftDB - compound identification and structure elucidation support through a free community-build web database. Phytochemistry 2004, 65:2711-2717. 86. Willighagen EL: Chemical Archeology: OSCAR3 to NMRShiftDB.org.[http:// chem-bla-ics.blogspot.com/2006/09/chemical-archeology-oscar3-to.html]. 87. CrystalEye. [http://wwmm.ch.cam.ac.uk/crystaleye/]. 88. Blue Obelisk QA. [http://blueobelisk.shapado.com]. 89. Stack Overflow. [http://stackoverflow.com]. 90. Lulu. [http://lulu.com]. 91. Herráez A: In How to use Jmol to study and present molecular structures. Volume 1. Lulu Enterprises, Morrisville, NC, US; 2007. 92. Willighagen E: Groovy Cheminformatics with the Chemistry Development Kit Lulu Enterprises, Morrisville, NC, US; 2011. 93. Hutchison GR, Morley C, O’Boyle NM, James C, Swain C, De Winter H, Vandermeersch T: Open Babel - Official User Guide Lulu Enterprises, Morrisville, NC, US; 2011. 94. Blue Obelisk web site. [http://blueobelisk.org]. 95. Day N, Murray-Rust P, Tyrrell S: CIFXML: a schema and toolkit for managing CIFs in XML. J Appl Cryst 2011, 44:628-634. 96. Hawizy L, Jessop D, Adams N, Murray-Rust P: ChemicalTagger: A tool for Semantic Text-mining in Chemistry. J Cheminf 2011, 3:17. Publish with ChemistryCentral and every 97. O’Boyle NM, Vandermeersch T, Flynn CJ, Maguire AR, Hutchison GR: Confab scientist can read your work free of charge - Systematic generation of diverse low-energy conformers. J Cheminf Open access provides opportunities to our 2011, 3:8. 98. Chempedia. [http://chempedia.com]. colleagues in other parts of the globe, by allowing anyone to view the content free of charge. doi:10.1186/1758-2946-3-37 W. Jeffery Hurst, The Hershey Company. Cite this article as: O’Boyle et al.: Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on. Journal of available free of charge to the entire scientific community Cheminformatics 2011 3:37. peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours you keep the copyright Submit your manuscript here: http://www.chemistrycentral.com/manuscript/ J. Cheminf. 2011, 3, 37.