Introduction To Cheminformatics
Introduction To Cheminformatics
Chemoinformatics is an interface science aimed primarily at discovering novel chemical entities that will
ultimately result in the development of novel treatments for unmet medical needs, although these same
methods are also applied in other fields that ultimately design new molecules. The field combines expertise
from, among others, chemistry, biology, physics, biochemistry, statistics, mathematics, and computer science.
In this general review of chemoinformatics the emphasis is placed on describing the general methods that
are routinely applied in molecular discovery and in a context that provides for an easily accessible article for
computer scientists as well as scientists from other numerate disciplines.
Categories and Subject Descriptors: A.1 [Introductory and Survey]; E.1 [Data Structures]: Graphs and
networks; G.0 [Mathematics of Computing]: General; H.3.0 [Information Storage and Retrieval]: Gen-
eral; I.2.8 [Artificial Intelligence]: Problem Solving, Control Methods, and Search; I.5.3 [Pattern Recogni-
tion]: Clustering; J.2 [Physical Sciences and Engineering]: Chemisty; J.3 [Life and Medical Sciences]:
Health
General Terms: Algorithms, Design, Experimentation, Measurement, Theory
Additional Key Words and Phrases: Chemoinformatics, chemometrics, docking, drug discovery, molecular
modeling, QSAR
ACM Reference Format:
Brown, N. 2009. Chemoinformatics—an introduction for computer scientists. ACM Comput. Surv.
41, 2, Article 8 (February 2009), 38 pages DOI = 10.1145/1459352.1459353 http://doi.acm.org/10.1145/
1459352.1459353
1. INTRODUCTION
Chemistry research has only in recent years had available the technology that allows
chemists to regularly synthesize and test thousands, if not hundreds of thousands,
of novel molecules for new applications, whereas before these technologies existed,
a typical chemist would consider only one to two molecules a week. However, infor-
mation management and retrieval systems have not developed sufficiently in pace
with these technologies to allow for the generated information to be collated and an-
alyzed in a standard and efficient manner thereby making best use of our knowledge
base.
Author’s address: The Institute of Cancer Research, 15 Cotswold Road, Sutton, SM2 5NG, Surrey, U.K.;
email: [email protected].
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for
components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
work in other works requires prior specific permission and/or a fee. Permissions may be requested from
Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-
0481, or [email protected].
2009
c ACM 0360-0300/2009/02-ART8 $5.00. DOI 10.1145/1459352.1459353 http://doi.acm.org/10.1145/
1459352.1459353
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
8:2 N. Brown
Indeed, there are many significant challenges still open not only in chemical
databases, but also in the methods to analyze properly the molecular structures to
transform the data into information that can be interpreted. This is one of the most
important aspects of current chemistry research, since it can be predicted that making
available computational tools to analyze molecules will reduce the numbers of experi-
ments that chemists must perform. It has even been speculated that the vast majority
of the discovery process for novel chemical entities (NCEs) will one day be performed
in silico rather than in vitro or in vivo.
An example of the historical interface between chemistry and computer science is
provided in the story surrounding the development of a fragment screening system
in chemical systems that used fragment codes. The fragment screening systems were
initially developed to increase the speed at which large databases of molecules could
be prefiltered for the presence or absence of a particular chemical substructure, before
proceeding on to a more intensive graph-matching algorithm. These methods were later
adapted to text searching by Michael Lynch—a pioneer in chemical informatics—and
colleagues in Sheffield and later for text compression by Howard Petrie, also in Sheffield.
Lynch discussed this concept with the Israeli scientists Jacob Ziv and Abraham Lempel
in the early 1970s, who went on to generalize the concept and adapt it in their Ziv-
Lempel (LZ77 and LZ78) algorithms, which went on to become the Lempel-Ziv-Welch
(LZW) algorithm, which is now used widely in compression of data. Therefore, we can
see that the cross-fertilization of ideas from two essentially similar fields in theory, yet
different in application, has led to paradigm shifts that were completely unanticipated
[Lynch 2004].
This article is intended as a general introduction to the current standard methods
applied in the field of chemoinformatics that is accessible to computer scientists. In
this aim, it is intended to provide a useful and extensive starting point for computer
scientists, which will also be of general interest to those in any numerate discipline
including the field of chemistry itself. The article covers historical aspects of chemistry
and informatics to set the field in context, while also covering many of the current
challenges in chemoinformatics and popular techniques and methods that are routinely
applied in academia and industry.
Many of the activities performed in chemoinformatics can be seen as types of infor-
mation retrieval in a particular domain [Willett 2000]. In document-based information
retrieval we can apply transforms to describe a document of interest that then permits
an objective determination of similarity to additional documents to locate those doc-
uments that are likely to be of most interest. In chemoinformatics this is similar to
searching for new molecules of interest when a single molecule has been identified as
being relevant.
Although perhaps less known than the sister field of bioinformatics [Cohen 2004],
chemoinformatics has a considerable history both in research and in years. Whereas
bioinformatics focuses on sequence data, chemoinformatics focuses on structure infor-
mation of small molecules. A great deal of chemoinformatics research has been con-
ducted in a relatively small number of world-class academic laboratories. However,
due to the applied nature of the field, a considerable amount has also been achieved
in large chemical companies, including pharmaceuticals, petrochemicals, fine chemi-
cals, and the food sciences. Although many areas of research within chemoinformatics
will be touched upon in this introduction for computer scientists, it is not intended
as a full and comparative critique of existing technologies, but as a brief overview of
the main endeavors that a typical scientist conducting research in this field would
know instinctively. References have been provided to review articles on particular top-
ics to allow the interested reader ready access to articles that detail the most salient
points regarding each of the methods and applications discussed herein. Further details
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
Chemoinformatics—An Introduction for Computer Scientists 8:3
on chemoinformatics are available in two excellent textbooks [Leach and Gillet 2003;
Gasteiger and Engel 2003], and in more general surveys of the field [Bajorath 2004;
Gasteiger 2003; Oprea 2005a].
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
8:4 N. Brown
(CAS) currently contains fewer than 33 million (as of September 19, 2007) molecules.
Therefore, the theoretical druglike chemistry space contains anything from 3 × 105 to
10173 times more druglike molecules than we have currently registered in CAS. More-
over, CAS also contains molecules that are not necessarily druglike.
To be able to consider this vast druglike chemistry space, it is necessary to deploy
computer systems using many diverse methods which allow a rational exploration of the
space without evaluating every molecule while also capitalizing on the extant molecules
we have stored in our compound archives.
In drug discovery the search for NCEs is a filtering and transformation process. Ini-
tially, we can consider the potential druglike space; however, this can be difficult and
leads to issues in synthetic accessibility—that is can we actually make these molecules
in the laboratory that have not yet been solved for practical application? What is more
normal is to consider the lists of available molecules from compound vendors and
also molecules that are natural products. However, these lists still run into the many
millions.
At this point we can introduce chemoinformatics techniques to assist in filtering the
space of available molecules to something more manageable while also maximizing
our chances of covering the molecules with the most potential to enter the clinic and
maintaining some degree of structural diversity to avoid prospective redundancies or
premature convergence. Each of these objectives must be balanced, which is no easy
task. However, once this stage has been finalized to the satisfaction of the objectives
under consideration, it typically contains only a few millions of molecules. These com-
pound archives require vast and complex automation systems to maintain them and
facilitate the usage of the compounds contained therein.
From this library screening sets are selected where each of the molecules in the set
is tested against a biological target of interest using initial high-throughput screening
(HTS). A hit in this context is a molecule that is indicated to have bound to our protein of
interest. Typically, these hitlists are filtered using chemoinformatics methods to select
only those molecules in which we are most interested; this process is often referred to
as HTS triaging. From this comes a smaller triaged hitlist with which it is possible to
cope in a higher-quality validation screen that often tests molecules in replicate to avoid
potential technological artifacts. A summary of this filtering process in drug discovery
is given in Figure 2.
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
Chemoinformatics—An Introduction for Computer Scientists 8:5
Fig. 2. The filtering processes in drug discovery that are applied to our molecular collections and ultimately
result in a drug that can be brought to the clinic. The hits are those molecules returned from an initial
screening program, while the leads or lead candidates are those that are followed up in medicinal chemistry.
The first step in HTS is to decide which particular compounds are to be screened.
Although druglike chemistry space is vast, only a relatively tiny subset of this space is
available. However, even this space runs into the many millions of compounds. There-
fore, it is important to determine which molecules should be included in our screening
libraries to be tested with HTS.
Two extremes are evident in many HTS programs: diverse and focused screening
libraries. The diverse libraries are often used for exploratory research, while focused
sets are used to exploit knowledge from related targets in maximizing our hits in new
targets. Corporate collections, historically, tend to be skewed to the particular areas of
endeavor for which each of the companies are known. Recently, however, efforts have
been made to increase the coverage of the available chemistry space to maximize the
possibility of covering our biological activity space, typically using diversity selection
methods, as discussed in Section 6 [Schuffenhauer et al. 2006].
Once HTS has been completed on a particular library, it is frequently necessary to
further rationalize the set of hits for reasons of pragmatism or the particular HTS tech-
nology applied since HTS can return many hits dependent on the particular assay. Here
again similar methods can be applied as is in the compound acquisition phase, although
undesirable compounds will have already been filtered by this stage. Many alternative
approaches exist in the literature for this HTS triaging phase, and typically the aim
is once again to maximize our set of hits using extant knowledge, while also exploring
new and potentially interesting regions of our explored chemistry space. This is often
achieved using in silico tools that permit case-by-case alterations to our workflow.
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
8:6 N. Brown
. . . the application of informatics to solve chemical problems . . . [and] chemoinformatics makes the point
that you’re using one scientific discipline to understand another scientific discipline.
Johann Gasteiger, 2002 (cited in Russo [2002], page 5)
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
Chemoinformatics—An Introduction for Computer Scientists 8:7
Fig. 3. Two examples of the molecular graphic notations by Alexander Crum Brown; note that the hydrogen
atoms are explicit in these diagrams.
atomistic theory and this was where the most substantial interest in the mathemati-
cal abstractions from Euler became apparent. Numerous chemists worked to formalize
representation systems of molecules. Arthur Cayley (1821–1895) used graph-like struc-
tures in his studies on enumerating the isomers of alkanes. However, two scientists in
particular provided the field of graph theory with the beginnings of a formal name:
Alexander Crum Brown (1838–1922) and James Joseph Sylvester (1814–1897). Crum
Brown developed the constitutional formulae in 1864 representing atoms as nodes and
the bonds as edges [Crum Brown 1864]. Crum Brown was at pains to state that these
graphs represented abstractions of molecules in that they were not intended to be
accurate representations of real molecules, but merely to illustrate the relationships
between the atoms (Figure 3). Sylvester also developed his very similar representation
(Figure 4) around the same time as Crum Brown. Crum Brown referred to his struc-
tures as molecular graphic notations while Sylvester called his chemicographs. While
it is difficult to say with any certainty which scientist applied the graph term first, it is
incontrovertible that both Crum Brown and Sylvester, both chemists, at least assisted
in the development of a new name for the field: graph theory.
The field of graph theory continued in its own right as a field of mathematics resulting
in the very active field we know today. However, chemistry had also only just begun
its foray into graph theory. Almost a century after the research from Crum Brown and
Cayley, the broader field of mathematical chemistry emerged in its own right, with
this field applying mathematics in an effort to understand chemical systems and make
predictions of molecular structure.
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
8:8 N. Brown
2 · |E(G)|
. (1)
|V (G)| · (|V (G)| − 1)
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
Chemoinformatics—An Introduction for Computer Scientists 8:9
Fig. 5. The hydrogen-depleted molecular graphs of (a) caffeine, (b) aspirin, and (c) D-lysergic acid diethy-
lamide.
The molecular graph is a type of graph that is undirected and where the nodes are
colored and edges are weighted. The individual nodes are colored according to the par-
ticular atom type they represent carbon (C), oxygen (O), nitrogen (N), chlorine (Cl),
etc., while the edges are assigned weights according to the bond order single, double,
triple, and aromatic. Aromaticity is an especially important concept in chemistry. An
aromatic system, such as the benzene ring, involves a delocalized electron system where
the bonding system can be described as somewhere between single and double bonds,
as in molecular orbital (MO) theory [Bauerschmidt and Gasteiger 1997]. In the case of
the benzene ring—a six-member carbon ring—six π electrons are delocalized over the
entire ring. A common approach to representing an aromatic system in a computer is to
use resonant structures, where the molecule adopts one of two bonding configurations
using alternating single and double bonds. However, this is an inadequate model for
the representation of aromaticity and therefore the use of an aromatic bond type is also
used. Molecular graphs also tend to be hydrogen depleted, that is, the hydrogens are
implicitly represented in the graph since they are assumed to fill the unused valences
of each of the atoms in the molecule. Each atom is ascribed a particular valence that
is deemed at least to be indicative of the typical valence of the molecule: carbon has
a valence of 4, oxygen has 2, and hydrogen has 1. The molecular graph representa-
tions of (a) caffeine, (b) aspirin, and (c) D-lysergic acid diethylamide are provided in
Figure 5.
3. MOLECULAR REPRESENTATIONS
Molecules are complicated real-world objects; however, their representation in the com-
puter is subject to a wide range of pragmatic decisions based largely on the domain of
interest to which the data structures are to be applied, but also decisions that were
made according to the availability of computational resources. There is a hierarchy of
molecular representations with each point in the hierarchy having its own domain of
applicability.
In chemoinformatics, the most popular representation is the two-dimensional (2D)
chemical structure (topology) with no explicit geometric information. However, even
these objects necessitate the application of highly complex computational algorithms
to perform comparisons and transforms.
As a molecule of significant importance to computer science, various representations
of caffeine have been provided in Figure 6. The representations range from the general
name itself (caffeine), through an arbitrary number assigned to the molecule (CAS
Registry Number), a typical database record entry in Simplified Molecular Input Line
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
8:10 N. Brown
Fig. 6. Some of the many ways in which molecules can be represented from simple names, empirical formu-
lae, and line notations, through to computational models of their structure.
Entry Specification (SMILES) and Structure Data Format (SDF), and on to graph-based
and geometric-based models of the molecule itself.
However, between the explicit and implicit hydrogen models of molecules, there ex-
ists the polar hydrogen model that includes those hydrogens that are likely to be in
long-range hydrogen bonds, and is more frequently used in molecular mechanics (MM)
applications.
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
Chemoinformatics—An Introduction for Computer Scientists 8:11
Fig. 6. Continued.
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
8:12 N. Brown
Fig. 7. The connection table representation of caffeine, with the most pertinent information regarding the
structure highlighted.
1 www.symyx.com.
2 www.tripos.com.
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
Chemoinformatics—An Introduction for Computer Scientists 8:13
Fig. 8. Some examples of simple molecules, their topologies, and the corresponding SMILES strings that
explain connectivity, branching, and ring systems. In the fifth and seventh instances, alternative SMILES
representations are given for representing aromatic systems.
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
8:14 N. Brown
Fig. 9. Example of the InChI code for caffeine with the chemical formula, atom connection, and hydrogen
sublayers of the main layer.
(4) Cycles are represented by a common number following the atoms that connect to
form the cycle.
(5) In aromatic systems the atomic characters that are part of the aromatic systems
are written in lower-case.
(6) Last, since single bonds are implicit, it is necessary for a disconnection to be encoded
explicitly with the full-stop or period (“.”) character.
One issue that quickly arose with the SMILES notation was the lack of a unique
representation since a molecule can be encoded beginning anywhere in addition to other
limitations with regard to which path to take in encoding the molecule. From Figure 8,
ethanol could be encoded as the following four SMILES strings, each of which is valid:
CCO, OCC, C(C)O, and C(O)C. This limited the application of SMILES as a unique
identifier in database systems. Therefore, a method of encoding a molecule was quickly
developed that provided an invariant SMILES representation. The Morgan algorithm
[Morgan 1965] was proposed in the 1960s to provide a canonical ordering of atoms in
a molecule for just such an application: indexing in database systems. The Morgan
algorithm proceeds by assigning values to each of the atoms of a molecule iteratively,
based on their extended connectivities; initially assigned values are based on the node
degree of each of the atoms, excluding hydrogens. The node coding and partitioning
approaches in the Morgan algorithm are analogous to the methods applied in most
graph and subgraph isomorphism algorithms.
Recent developments in line notations are the InChI (International Chemical Iden-
tifier) codes, supported by the International Union of Pure and Applied Chemistry
(IUPAC), which can uniquely describe a molecule in a very compact form (Figure 9),
but is not intended for readability [Adam 2002; Coles et al. 2005]. The InChI code of
a molecule encodes that molecule in a series of six layers: main, charge, stereochem-
ical, isotopic, fixed-H, and reconnected. Each of these layers can be further split into
sublayers, but no further splitting of layers is permitted. The main layer can be split
further into chemical formula, atom connections, and hydrogen atom sublayers; the
main layer and the chemical formula sublayer are the only two mandatory components
of every InChI code. The InChI code system was developed to be an open identifier
for chemical structures primarily for printed and electronic publications to allow the
data to be captured for subsequent indexing and searching. The sublayers are used to
discriminate unique molecules that are otherwise not distinct when using alternative
representation methods such as can be the case with SMILES.
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
Chemoinformatics—An Introduction for Computer Scientists 8:15
molecules are 3D objects, it would seem that one of the standard 3D graph-layout
algorithms would suffice; however these layouts do not necessarily take into account
chemical knowledge. Therefore, a number of programs have been developed that can
generate a single conformer, or multiple possible conformers (since molecules take dif-
ferent conformers depending on their environment), that represents a minimum-energy
conformation—a conformation in which the geometric arrangement of atoms leads to
a global minimum in the internal energy of the system. Two of the most popular pro-
grams for this purpose are Concord [Pearlman 1987], and CORINA (COoRdINAtes)
[Gasteiger et al. 1990]. Both of these programs operate in a similar way through the
application of chemical knowledge.
The CORINA program, for example, has a number of rules for bond angles and lengths
based on the atom type involved and its particular hybridization state. Rings are consid-
ered individually, with particular conformations being selected from ring conformation
libraries that have been generated from mining the information in crystallographic
databases. Pseudo-force-field calculations and the removal of overlapping atoms are
then performed to clean the resultant conformers.
The resultant low-energy conformations returned by CORINA have been demon-
strated to have low root-mean-square deviation (RMSD) values when compared with
X-ray crystal structures from the Cambridge Structural Database (CSD) from the Cam-
bridge Crystallographic Data Centre (CCDC).
Many of the molecular geometry generator programs also permit the generation of
multiple low-energy conformations of a single molecule, allowing the user to select the
preferred conformation for their particular application, or in some way combining all
the conformations and then applying this combined description. However, although
intuitively it is expected that 3D will be superior to 2D or topological representations,
this has not been shown to be the case in many instances.
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
8:16 N. Brown
have been demonstrated to be effective for the predictive modeling of congeneric series
of molecules based on molecular alignments—whether these are aligned by hand or by
an automated process. A congeneric series of molecules is one in which each molecule
contains a significant amount of structural similarity to permit meaningful alignments.
4. MOLECULAR DESCRIPTORS
The generation of informative data from molecular structures is of high importance in
chemoinformatics since it is often the precursor to permitting statistical analyses of
the molecules. However, there are many possible approaches to calculating informative
molecular descriptors [Karelson 2000; Todeschini and Consonni 2000].
When discussing molecular descriptors, one is reminded of the parable of the blind
men and the elephant by John Godfrey Saxe. In this poem, six blind men endeavor to
describe an elephant and variously determine that the elephant reminds them of a rope
(tail), tree (leg), snake (trunk), spear (tusk), fan (ear), and wall (body). This emphasizes
the local context information to which the blind men are privy. Essentially, in only
considering selected aspects of the elephant, the overall description of the elephant
is not forthcoming. In this way, molecules are similar to elephants since they contain
many features that in themselves are not particularly informative, but considered in
combination provide a rich characterization of the object under study.
4.2.1. Wiener Index. The Wiener index W is calculated as the sum of the number of
bonds between all nodes in a molecular graph, G. The shortest path is used to determine
the number of edges between the nodes and therefore the Floyd or Dijkstra algorithms to
calculate shortest paths between nodes are typically applied. The W index is calculated
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
Chemoinformatics—An Introduction for Computer Scientists 8:17
thus:
N
i
W = Di j , (2)
i=2 j =1
where N is the size of the molecule in atoms and Di j is the shortest path distance
between atoms i and j .
4.2.2. Randić Index. This topological index was developed by Randić and characterizes
the branching of a given molecule. It is also referred to as the connectivity or branching
index. The index is calculated by multiplying the product of the node degrees, δ, of the
nodes that are incident with each of the edges in the graph:
B
1
R= . (3)
b=1
(δi · δ j )b
3 http://www.talete.it.
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
8:18 N. Brown
is given a contribution factor with the ClogP calculated as the sum of the products of
the number of each particular atom and its contribution [Ghose and Crippen 1986].
Other common physicochemical descriptors that can be calculated in silico to varying
degrees of accuracy are pKa (acid dissociation constant), logD (octanol-water distribu-
tion, as opposed to partition, coefficient), logS (aqueous solubility, or logW logSw (water
solubility)), and the PSA (polar surface area). A review of these physicochemical prop-
erties has been given by Raevsky [2004].
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
Chemoinformatics—An Introduction for Computer Scientists 8:19
Fig. 10. An example of the encoding of a simple molecule as a structure-key fingerprint using a defined
substructure or fragment dictionary. A defined fragment is assigned a single bit position on the string to
which it, and no other fragment, is mapped.
being used to represent the molecule. Alternatively, the output from the CRC can be
used as a seed for a random number generator (RNG) and a number of indices (typical
4 or 5) being taken from the RNG, again using modulo arithmetic, to be mapped to the
fingerprint being generated. The rationale for the use of the RNG is to reduce the effect
of different molecular paths colliding at the same index in the fingerprint. Since each
path is now represented by four or five indices, the probability of another molecular
path exhibiting precisely the same bit pattern is vastly reduced. A schematic example
of the generation of a hash-key fingerprint is provided in Figure 11.
A recent advance in hash-key fingerprints has been provided by SciTegic in their
PipelinePilot workflow software [Brown et al. 2005]. In this configuration, circular atom
environments are enumerated, rather than atom paths, with these being canonicalized
using an algorithm such as that from Morgan as described previously, providing a
unique representation that acts as a key. Although circular substructures, or atom
environments, were first used in the 1970s for screening systems, this recent innovation
has provided a new type of molecular fingerprint descriptor that has been demonstrated
to be of great application in similarity searching [Hert et al. 2004; Rogers et al. 2005].
An example of the enumeration of atom environments for a single atom is provided in
Figure 12 for bond radii from the root atom of 0, 1, 2, and 3, respectively.
Although apparently quite simplistic in design, molecular hash-key fingerprint algo-
rithms have been demonstrated to be highly effective in encapsulating molecular infor-
mation, which is evident in the widespread application to many challenges in chemoin-
formatics. The Fingerprinting Algorithm (Fingal) descriptor [Brown et al. 2005] is one
such recent example of a hash-key fingerprint descriptor and is based substantially on
the fingerprints from Daylight Chemical Information Systems, Inc.,4 while also being
extended to encapsulate additional structural information, such as molecular geome-
tries in an alignment-free way. Fingal is additionally capable of encoding the Euclidean
distance between the current atom in the path and all of the previous atoms in the path
thus far.
Hash-key fingerprints are very rapid to calculate and encapsulate a great deal of the
information necessary for them to be effective in many applications in chemoinformat-
ics. However, they do have significant limitations and are not universally applicable.
The resultant descriptors are highly redundant and, more significantly, they are not
readily interpretable. However, the latter issue can be overcome to a large extent with a
4 http://www.daylight.com.
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
8:20 N. Brown
Fig. 11. A partial example of the encoding of caffeine as a hash-key fingerprint. The original structure
caffeine (a) with the root nitrogen atom for this particular path enumeration highlighted in bold. The enu-
meration (b) of paths from the root atom up to three bonds away represented as a tree. Each of the paths
in all levels of the enumerated path tree is then converted into one or more integer values using a hashing
algorithm and a pseudorandom number generator to give n bit positions (3 in this case) that are set for each
of the paths, here shown only for level 3, of the tree in (c); an instance of a “bit collision” is highlighted.
certain memory overhead required to store the molecular fragments subject to a degree
of fuzziness brought on the hashing approach.
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
Chemoinformatics—An Introduction for Computer Scientists 8:21
Fig. 12. An example of the enumeration of atom environments (augmented atom or circular substructure)
of caffeine at bond radii (a) 0, (b) 1, (c) 2, and (d) 3, respectively, highlighted in black, with the remaining
portion of the molecule in gray.
is difficult to map back to a single structure, although the particular ligands (specific
molecules that bind to a protein to evoke a response) that are interesting would most
likely still be apparent. The use of an information-rich descriptor such as hash-key
fingerprints that do their best to encode as much information as provided will tend to
map back only to a single molecular entity; the lack of frequency of occurrence infor-
mation in binary fingerprints would limit this ability, but it is still generally possible to
map back to the region of chemistry space. Alternatively, the use of integer fingerprints
permits a more accurate mapping back from the descriptor to the structure space and
has been used for this purpose in de novo design [Brown et al. 2004]; see Section 5 for
more information on de novo design.
There are two distinct types of pharmacophores: structure based and ligand based.
The structure-based pharmacophores use information from a protein binding site or a
bound conformation or crystallized structure of a ligand to derive a pharmacophore
model, whereas ligand-based pharmacophores consider only a given topology of a
molecule of interest and attempt to derive the salient information from that using var-
ious molecular characterization methods including non-target-specific conformations.
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
8:22 N. Brown
Fig. 13. Four molecules: (a) (–)-nicotine, (b) (–)-cytisine, (c) (–)-ferruginine methiodide, and (d) (–)-muscarone
used to derive the nicotinic pharmacophore by distance geometry and (e) the pharmacophore model obtained.
The pharmacophoric points of each of the molecules are given by the letters A, B, and C, respectively, and
equate to the features in the pharmacophore model in (e). Adapted from Leach [2001].
a ligand is sought to bind), this information can be used to develop a spatial arrange-
ment of desirable molecular properties to describe the desired molecule. This pharma-
cophore model can then be used to search against a 3D library of molecules with the
aim of retrieving molecules that will exhibit a similar pharmacophoric arrangement of
features and therefore be more likely to also exhibit the same biological activity, which
is concomitant with the similar-property principle.
Typically, 3-point pharmacophore descriptions are used—although 4-point pharma-
cophores have been published—and can be designed by hand using knowledge, or de-
rived automatically from given information. The distances between each of the feature
points are specified in Ångströms (Å, or 10−10 m) but also permitted to be within some
tolerance to permit a fuzzy matching with given molecules. An example of a pharma-
cophore for the nicotinic receptor is provided in Figure 13.
This approach is related to the concept of bioisosterism, where molecules or frag-
ments are said to be bioisosteric if they present a biologically important and similar
arrangement of features to the target. Bioisosterism is of great interest since replacing
substituents of molecules with bioisosteres can assist in improving potency against
the target, but also assist in designing out off-target effects that can cause undesired
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
Chemoinformatics—An Introduction for Computer Scientists 8:23
Fig. 14. Example of a single Similog key determination. Adapted from Schuffenhauer et al. [2003].
responses such as adverse cardiac events common to potential drugs interacting with
hERG (human Ether-a-go-go Related Gene) liabilities [Ertl 2007].
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
8:24 N. Brown
Fig. 15. Illustration of known inhibitors of known adenosine A2A -antagonists, important for treatment of
Parkinson’s disease: from the two natural products (a) adenosine (an agonist) and (b) caffeine (a subtype-
unselective antagonist) to the designed ligand (c) an instance of an A2A -antagonist. Adapted from Böhm et al.
[2004].
defines, and not necessarily a scaffold in the literal sense. The definition of a scaffold
is important since it is possible to determine whether an instance of scaffold hopping
(leapfrogging, lead-hopping, chemotype switching, and scaffold searching) has occurred:
that is, a complete change in the region of chemistry space being investigated, yet one
that elicits a similar biological response [Böhm et al. 2004; Brown and Jacoby 2006]. An
example of scaffold hopping is given in Figure 15 for A2A antagonists. It is important to
note that, although the molecule in Figure 15(c) wholly contains the caffeine scaffold,
this is still classified as a scaffold hop since the core structure has been modified with
an additional six-member ring system (benzene).
Experts with different backgrounds and knowledge will tend to define a scaffold dif-
ferently depending on their particular domains of interest. For instance, a synthetic
chemist may define a molecular scaffold based on the diversity not of the core struc-
ture, but on the relative diversity of the synthetic routes to the molecules themselves,
whereas, patent lawyers would typically consider only the general similarity of the in-
ternal structure of the molecule to determine whether or not that particular region of
scaffold chemistry space has prior art in terms of its Markush structure representa-
tion (see below) for the particular application domain of interest. Chemoinformaticians,
however, will always favor an objective and invariant algorithm that will provide a so-
lution rapidly and without ambiguity. In this case, a scaffold definition is provided by
a graph transformation algorithm that, given a molecular topological graph, ensures
that the scaffold can be realized deterministically. However, there are also significant
limitations in the scaffold determination algorithm that maintains current favor in
chemoinformatics.
4.6.1. Early Definitions of Scaffolds. One of the earliest scaffold descriptions was that
introduced by Eugene A. Markush of the Pharma-Chemical Corporation in a landmark
patent that was granted on August 26, 1924 [Markush 1924]—although this was not
the first patent to include such a generic definition. Markush’s patent covered an entire
family of pyrazolone dye molecules:
The process for the manufacture of dyes which comprises coupling with a halogen-substituted pyrazolone,
a diazotized unsulphonated material selected from the group consisting of aniline, homologues of aniline
and halogen substitution products of aniline [Markush 1924, page 2].
In making this claim, Markush was able to claim rights not to just an individual
compound of interest, but also a large number of molecules of only potential inter-
est in the chemistry space surrounding the actual molecule synthesized at the center
of the claim. Markush structures are now used extensively to protect chemical series of
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
Chemoinformatics—An Introduction for Computer Scientists 8:25
Fig. 16. The (b) molecular, (c) graph, (d) reduced scaffold framework representations for the caffeine molecule
(a), respectively.
interest in patents in any industry that develops NCEs. The Markush generic structure
is more concerned with intellectual property rights rather than a scientific basis and
it is therefore not absolutely necessary that all of the molecules covered by a Markush
representation can be synthesized.
The previous scaffold definition was designed specifically for intellectual property
applications, but the scientific definition is also important to describe classes of related
molecules accurately and invariantly. However, the definition of a scaffold is deceptively
trivial to state, but incredibly difficult—if at all possible—to reduce to a set of generic
rules that do not consider how the definition will be applied. For an early reference
for an acceptable definition of a scaffold, as we tend to mean it today, we can go to
the American Chemical Society (ACS) literature database and an article by Reich and
Cram [1969] that describes it thus:
The ring system is highly rigid, and can act as a scaffold for placing functional groups in set geometric
relationships to one another for systematic studies of transannular and multiple functional group effects
on physical and chemical properties (page 3527).
Although the definition is explanatory, it does not provide the practitioner with a
rigorous and invariant description which would allow the determination of the scaffold
component of any given molecule. Typically, the scaffold definitions given by Bemis and
Murcko [1996] are now used widely in chemoinformatics to determine the “scaffold”
of a molecule in an approach that is invariant and unbiased. These abstracted graph
descriptions can then be applied in classification problems to enable the practitioner to
group molecules by “scaffold” as another approach to diversity selection (discussed in
more detail in Section 6.4).
From a single molecule, it is possible to generate the Bemis and Murcko [1996] scaf-
fold or molecular framework, as well as the graph framework, as required. The former
prunes side chains of the molecule, but maintains the original atom typing and bond or-
ders used. The latter takes the same molecular framework and then proceeds to further
abstract the atoms and bonds to uncolored and unweighted nodes and edges, respec-
tively, thus giving an indication of the general framework of each molecule considered.
The graph frameworks can be further abstracted by representing the ring systems as
nodes of the graph. An example of a molecule, caffeine (Figure 16(a)) with its molecular,
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
8:26 N. Brown
graph, and reduced scaffold frameworks is provided in Figures 16(b), 16(c), and 16(d),
respectively.
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
Chemoinformatics—An Introduction for Computer Scientists 8:27
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
8:28 N. Brown
Table I. The Similarity and Distance Coefficients That Are Used Most
Frequently in chemoinformatics (For the dichotomous variants, the variables are
defined as a = number of bits set in the first binary vector, b = number of bits
set in the second binary vector, and c = the number of bits set in common
between both binary vectors.)
Name Dichotomous Continuous
K
c xik x j k
Cosine √ k=1
a·b K K
k=1
(xik ) 2
k=1
(x j k )2
K
2c 2 x jkx jk
Dice K k=1
K
a+b (xik )2 + (x j k )2
k=1 k=1
√ K 2
Euclidean a + b − 2c k=1
xik − x j k
K
Hamming a + b − 2c k=1
|xik − x j k |
K
c xik x j k
Tanimoto (Jaccard) k kk=1 2 k
a +b−c (xik )2 + (x j k ) − xik x j k
k=1 k=1 k=1
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
Chemoinformatics—An Introduction for Computer Scientists 8:29
6.2.1. Stirling Numbers of the Second Kind. The space of possible cluster assignments
for a given set of N objects into k unlabeled and disjoint cells or clusters is truly
astronomical for all but the smallest of values of N and k. These values are referred
to as Stirling numbers of the second kind, after the Scottish mathematician James
Stirling [Goldberg et al. 1972]. The calculation of the Stirling number of the second
kind is through the recurrence relation
where 1 ≤ k < N given that the following initial conditions, or base cases, are met:
The Stirling number of the first kind is different to the first kind in that the clusters or
cells are labeled and therefore all permutations of cluster assignments are considered
as distinct. The Bell number is the summation of the Stirling number of the second kind
for all values of k from 1 to N − 1 and therefore provides the total number of possible
clustering partitions available.
As an indicator, according to the Stirling number of the second kind, to partition
1006 objects into 196 nonempty cells, there are approximately 6.294 × 101939 possible
unique partitioning schemes. Therefore, it is readily apparent that the space of potential
partitioning schemes is vast and it would therefore be very easy to arrive at a grouping
that is even slightly incorrect, whatever incorrect means in this context.
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
8:30 N. Brown
ranging from the most to the least similar. The assumption here is that if we have
restricted resources, such as only being able to test n% of the dataset, we can prioritize
compounds using similarity searching such that the top n% of our ranked list will be
more likely to exhibit our desired properties.
The typical way to objectively evaluate the quality of a particular similarity searching
campaign is to test the recall of a similarity search using a particular search query
molecule that is known to be active against a dataset that has been seeded with active
compounds with the same activity, but not necessarily similar. This may then be plotted
as an enrichment curve, where the axes are the percentage or number of database
compounds screened, against the number of actives recalled at that screening level.
This provides a very intuitive method of evaluating the quality of one particular method
over another and one can readily determine the quality of enrichment relevant to one’s
particular screening level (Figure 17(a)).
Recently, an approach known as data fusion (or consensus scoring in the docking com-
munity) has gained widespread acceptance in combining values from multiple sources
for a single object to further enrich our similarity searches [Hert et al. 2004]. Essen-
tially, there are two particular approaches to data fusion that are currently applied in
similarity searching: similarity fusion and group fusion. In similarity fusion, a single
query or reference molecule is used to search against a structure database using a va-
riety of methods such as alternate descriptors or similarity coefficients. Group fusion,
however, uses multiple reference molecules with a single descriptor and coefficient.
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
Chemoinformatics—An Introduction for Computer Scientists 8:31
Fig. 17. An illustration of the related tasks of (a) similarity searching and (b) diversity (or subset) selection.
In the case of similarity searching (a), the aim is to locate the nearest neighbors of a molecule of interest.
However, it can be seen that potentially interesting “islands” of activity can be missed using this approach.
Diversity selection (b), on the other hand, seeks to select a subset of compounds from a larger set such that
the space is covered—here, in an approach referred to as sphere exclusion.
6.4.2. Cell-Based Compound Selection. Cell-based methods are one of the simplest ap-
proaches to compound selection. The algorithm proceeds by partitioning the, most likely,
high-dimensional descriptor space into equidistant or varidistant partitions with the
aim of arriving at a number of cells in the space that is closest to the desired number of
data points. Then a single point is selected from each of the cells according to some rule.
Again, this may simply be random, or by picking the most central point, or centroid.
6.4.3. Cluster Analysis. The clustering methods mentioned previously can also be ap-
plied for diverse subset selection [Schuffenhauer et al. 2006]. One proceeds by perform-
ing the cluster analysis such that the number of clusters, equals the number of data
points required in the diverse subset. A single object may then be selected from each of
the clusters, preferably be selecting the object that is nearest the center of the cluster
(or centroid).
6.4.4. Onion Design. The final method to be considered here is the use of a method
called D-optimal onion design (DOOD) by extension of D-optimal design as a space-
filling design [Eriksson et al. 2004], which in isolation gives a shell design. The ap-
plication of D-optimal design on a given space will provide a designed set of points
that covers the surface of the space covered by the given objects, which is limited in
applicability as mentioned earlier in this section. However, by iteratively performing a
D-optimal design, and removing the surface points from the space, this will develop a
designed subset that covers the volume of the space.
7. PREDICTIVE MODELING
As we have seen already, mathematics and chemistry have long been related disci-
plines. As early as the mid-19th century, Alexander Crum Brown (1838–1922) and
Thomas Richard Fraser (1841–1920) suggested that a mathematical relationship can
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
8:32 N. Brown
The challenge in defining the function mapping was seen as largely due to the accuracy
in defining C, and , respectively [Livingstone 2000]. These definitions remain a chal-
lenge today, but the general issue of relating structure to property is of considerable
importance in modern drug discovery and is used widely to guide decision making, even
though our models are not as accurate as we would wish them to be.
Statistical models are of great importance in chemoinformatics since they allow the
correlation of a measured response (dependent variable) such as biological activity
with calculated molecular descriptors (independent variables). These models can then
be applied to the forward problem of predicting the responses for unseen data points
entirely in silico.
Two particular types of supervised learning methods are applied widely in chemoin-
formatics: classification and regression. Classification methods assign new objects, in
our case molecules, to two or more classes—most frequently either biologically active or
inactive. Regression methods, however, attempt to use continuous data, such as a mea-
sured biological response variable, to correlate molecules with that data so as to predict
a continuous numeric value for new and unseen molecules using the generated model.
The most-often used methods for classification are partial least squares—discriminant
analysis (PLS-DA), naı̈ve Bayesian classifier (NBC), recursive partitioning (RP), and
support vector machines (SVM), whereas, for regression modeling, other methods are
used like multiple linear regression (MLR), partial least squares (PLS), and artificial
neural networks (ANNS).
The clustering methods discussed in section 6.2 and classification methods described
here fall into two particular types of statistical learning methods: unsupervised and
supervised, respectively. Unsupervised learning is used to determine natural groupings
of objects based solely on their independent variables, as in SAHN, K -means clustering,
SOM, PCA, and MDS, whereas, supervised statistical learning uses a priori information
regarding the classes to which the objects in a training set belong, as in PLS-DA, NBC,
RP, and SVM. This model is then used to classify new objects. In like vein, regression
modeling is also a type of supervised learning method.
Predictive models and the modelers who generate them generally fall into one of two
domains of application: the predictive and the interpretive modes. There is generally a
tradeoff between prediction quality and interpretation quality, with the modeler deter-
mining which is preferred for the given application (Figure 18). Interpretable models
are generally desired in situations where the model is expected to provide information
about the problem domain and how best to navigate through chemistry space allowing
the medicinal chemist to make informed decisions. However, these models tend to suffer
in terms of prediction quality as they become more interpretable.
The reverse is true with predictive models in that their interpretation suffers as
they become more predictive. Models that are highly predictive tend to use molecu-
lar descriptors that are not readily interpretable by the chemist such as information-
based descriptors. However, predictive models are generally not intended to provide
transparency, but predictions that are more reliable and can therefore be used as high-
throughput models for filtering tasks or similar approaches.
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
Chemoinformatics—An Introduction for Computer Scientists 8:33
Fig. 18. The tradeoff surface between the predictive and interpretive modes of predictive models and ex-
amples of the types of descriptors or methods that are used. This figure has been adapted from an original
diagram by Lewis, published in Brown and Lewis [2006].
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
8:34 N. Brown
numerous model statistics are available that can indicate if new data points, from which
responses are to be predicted, can be applied to the model [Brown and Lewis 2006].
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
Chemoinformatics—An Introduction for Computer Scientists 8:35
spaces are therefore explored as the generations of the GA are iterated while the mi-
gration of solutions between the individual populations has the effect of improving the
efficiency of the algorithm.
8. OVERVIEW
Chemoinformatics is now an essential component of chemical discovery, and nowhere is
this more apparent than in the development of new pharmaceutical treatments to treat
unmet medical needs. The field has a long and varied heritage, exhibiting influences
from chemistry, of course, but also from mathematics, statistics, biology, computer sci-
ence, and more besides. Now, the field of chemoinformatics is truly an interface science
requiring skilled scientists from all necessary fields to be able to direct our research en-
deavors in the right direction and lead to an enriching and rewarding area of research
that will only increase in its importance to drug discovery and other fields of chemistry
in the coming years.
In conclusion, it is fitting to reflect on how rapidly chemoinformatics has become a
mainstay of chemical research with an “Irishism” from an early pioneer of the field,
Michael Lynch: “Here we sit side by side with those on whose shoulders we stand.”
ACKNOWLEDGMENTS
The author would like to thank his academic mentors Peter Willett (University of Sheffield, U.K.) and Johann
Gasteiger (University of Erlangen-Nürnberg, Germany), together with his mentors from industry Richard
Lewis (Eli Lilly & Co.), Ben McKay (Avantium Technologies), and Edgar Jacoby (Novartis Institutes for
BioMedical Research), for their support and encouragement in pursuing novel research in chemoinformatics.
In addition, the author would like to thank the following colleagues from the Novartis Institutes for BioMed-
ical Research, Basel Switzerland: Kamal Azzaoui, Peter Ertl, Stephen Jelfs, Jörg Mühlbacher, Maxim Popov,
Ansgar Schuffenhauer, and Paul Selzer. The author would also like to thank all of the previous and present
members of the chemoinformatics research groups in Sheffield and Erlangen—along with all of the many re-
searchers with whom he has collaborated, including those from Eli Lilly and Co., Avantium Technologies, the
Novartis Institutes for BioMedical Research, and the Institute of Cancer Research—for their encouragement
and fostering of an interdisciplinary approach to chemoinformatics and modern drug discovery. The author
dedicates this article to the memory of his mum who encouraged him from an early age to be inquisitive
about the world and ask questions, which led him to a career in science.
REFERENCES
ADAM, D. 2002. Chemists synthesize a single naming system. Nature 417, 369.
BAJORATH, J., ED. 2004. Chemoinformatics: Concepts, Methods and Tools for Drug Discovery. Humana
Press, Totowa, NJ.
BALABAN, A. T. 1985. Applications of graph theory in chemistry. J. Chem. Inf. Comput. Sci. 25, 334–343.
BARNARD, J. M. AND DOWNS, G. M. 1992. Clustering of chemical structures on the basis of two-dimensional
similarity measures. J. Chem. Inf. Comput. Sci. 32, 644–649.
BAUERSCHMIDT, S. AND GASTEIGER, J. 1997. Overcoming the limitations of a connection table description: A
universal representation of chemical species. J. Chem. Inf. Comput. Sci. 37, 705–714.
BEMIS, G. W. AND MURCKO, M. A. 1996. The properties of known drugs. 1. Molecular frameworks. J. Med.
Chem. 39, 2887–2893.
BENDER, A. AND GLEN, R. C. 2004. Molecular similarity: A key technique in molecular informatics. Org.
Biomol. Chem. 2, 3204–3218.
BÖHM, H.-J., FLOHR, A., AND STAHL, M. 2004. Scaffold hopping. Drug Discov. Today: Tech. 1, 217–224.
BROOIJMANS, N. AND KUNTZ, I. D. 2003. Molecular recognition and docking algorithms. Ann. Rev. Biophys.
Biomol. Struct. 32, 335–373.
BROWN, F. K. 1998. Chemoinformatics: What is it and how does it impact drug discovery? Ann. Rep. Med.
Chem. 33, 375–384.
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
8:36 N. Brown
BROWN, N. AND JACOBY, E. 2006. On scaffolds and hopping in medicinal chemistry. Mini Rev. Med. Chem. 6,
1217–1229.
BROWN, N. AND LEWIS, R. A. 2006. Exploiting QSAR methods in lead optimization. Curr. Opin. Drug Discov.
Devel. 9, 419–424.
BROWN, N., MCKAY, B., AND GASTEIGER, J. 2005. Fingal: A novel approach to geometric fingerprinting and a
comparative study of its application to 3D QSAR modelling. QSAR Comb. Sci. 24, 480–484.
BROWN, N., MCKAY, B., GILARDONI, F., AND GASTEIGER, J. 2004. A graph-based genetic algorithm and its
application to the multiobjective evolution of median molecules. J. Chem. Inf. Comput. Sci. 44, 1079–
1087.
BROWN, R. D. AND MARTIN, Y. C. 1997. The information content of 2D and 3D structural descriptors relevant
to ligand-receptor binding. J. Chem. Inf. Comput. Sci. 37, 1–9.
CECHETTO, J. D., ELOWE, N. H., BLANCHARD, J. E., AND BROWN, E. D. 2004. High-throughput screening at
McMaster University: Automation in academe. J. Assoc. Lab. Auto. 9, 307–311.
COHEN, J. 2004. Bioinformatics—an introduction for computer scientists. ACM Comput. Surv. 36, 122–158.
COLES, S. J., DAY, N. E., MURRAY-RUST, P., RZEPA, H. S., AND ZHANG, Y. 2005. Enhancement of the chemical
semantic web through the use of InChI identifiers. Org. Biomol. Chem. 3, 1832–1834.
COREY, E. J. AND CHENG, X.-M. 1995. The Logic of Chemical Synthesis. Wiley, New York, NY.
CRAMER, R. D., III., PATTERSON, D. E., AND BUNCE, J. D. 1988. Comparative molecular field analysis
(CoMFA). 1. Effect of shape on binding of steroids to carried proteins. J. Amer. Chem. Soc. 110, 5959–
5967.
CRUM BROWN, A. 1864. On the theory of isomeric compounds. Trans. Roy. Soc. Edinb. 23, 707–719.
CRUM BROWN, A. AND FRASER, T. R. 1869. V.—On the connection between chemical constitution and physi-
ological action. Part. I.—On the physiological action of the salts of the ammonium bases, derived from
strychnia, brucia, thebaia, codeia, morphia, and nicotia. Trans. Roy. Soc. Edinb. 25, 151–203.
DIESTEL, R. 2000. Graph Theory, 2nd Ed. Springer-Verlag, New York, NY.
DIMASI, J. A., HANSEN, R. W., AND GRABOWSKI, H. G. 2003. The price of innovation: New estimates of drug
development costs. J. Health Econ. 22, 151–185.
DURANT, J. L., LELAND, B. A., HENRY, D. R., AND NOURSE, J. G. 2002. Reoptimization of MDL keys for use in
drug discovery. J. Chem. Inf. Comput. Sci. 42, 1273–1280.
ERIKSSON, L., ARNHOLD, T., BECK, B., FOX, T., JOHANSSON, E., AND KRIEGL, J. M. 2004. Onion design and its
application to a pharmaceutical QSAR problem. J. Chemomet. 18, 188–202.
ERIKSSON, L., JAWORSKA, J., WORTH, A. P., CRONIN, M. T. D., AND MCDOWELL, R. M. 2003. Methods for reliability
and uncertainty assessment and for applicability evaluations of classification- and regression-based
QSARs. Environ. Health Perspect. 111, 1361–1375.
ERTL, P. 2007. In silico identification of bioisosteric functional groups. Curr. Opin. Drug Discov. Devel. 10,
281–288.
FERRARA, P., PRIESTLE, J. P., VANGREVELINGHE, E., AND JACOBY, E. 2006. New developments and applications
of docking and high-throughput docking for drug design and in silico screening. Curr. Comp.-Aided Drug
Des. 2, 83–91.
FUJITA, T., IWASA, J., AND HANSCH, C. 1964. A new substituent constant, π , derived from partition coefficients.
J. Amer. Chem. Soc. 86, 5175–5180.
GASTEIGER, J., ED. 2003. The Handbook of Chemoinformatics. Wiley-VCH, Weinheim, Germany.
GASTEIGER, J. AND ENGEL, T., EDS. 2003. Chemoinformatics: A Textbook. Wiley-VCH, Weinheim, Germany.
GASTEIGER, J., PFÖRTNER, M., SITZMANN, M., HÖLLERING, R., SACHER, O., KOSTKA, T., AND KARG, N. 2000.
Computer-assisted synthesis and reaction planning in combinatorial chemistry. Persp. Drug Discov. Des.
20, 1–21.
GASTEIGER, J., RUDOLPH, C., AND SADOWSKI, J. 1990. Automatic generation of 3D atomic coordinates for
organic molecules. Tetrahed. Comput. Methodol. 3, 537–547.
GHOSE, A. K. AND CRIPPEN, G. M. 1986. Atomic physicochemical parameters for three-dimensional structure-
directed quantitative structure-activity relationships. I. Partition coefficients as a measure of hydropho-
bicity. J. Comp. Chem. 7, 565–577.
GILLET, V. J., WILLETT, P., BRADSHAW, J., AND GREEN, D. V. S. 1999. Selecting combinatorial libraries to optimize
diversity and physical properties. J. Chem. Inf. Comput. Sci. 39, 169–177.
GOLDBERG, K., NEWMAN, M., AND HAYNSWORTH, E. 1972. Combinatorial Analysis. In Handbook of Mathemat-
ical Functions With Formulas, Graphs, and Mathematical Tables, 10th ed. Abramowitz, M., Stegun, I. A.
Eds. U.S. Government Printing Office: Washington, DC, 824–825.
GORSE, A.-D. 2006. Diversity in medicinal chemistry space. Curr. Top. Med. Chem. 6, 3–18.
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
Chemoinformatics—An Introduction for Computer Scientists 8:37
GUND, P. 1979. Pharmacophoric pattern searching and receptor mapping. Ann. Rep. Med. Chem. 14, 299–
308.
GÜNER, O. F. 2005. The impact of pharmacophore modeling in drug design. IDrugs 8, 567–572.
HASTIE, T., TIBSHIRANI, R., AND FRIEDMAN, J. 2001. The Elements of Statistical Learning: Data Mining, Infer-
ence, and Prediction. Springer-Verlag, New York, NY.
HERT, J., WILLETT, P., WILTON, D. J., ACKLIN, P., AZZAOUI, K., JACOBY, E., AND SCHUFFENHAUER, A. 2004. Com-
parison of fingerprint-based methods for virtual screening using multiple bioactive reference structures.
J. Chem. Inf. Comput. Sci. 44, 1177–1185.
JOHNSON, M. A. AND MAGGIORA, G. M. Eds. 1990. Concepts and Applications of Molecular Similarity. Wiley
Inter-Science, New York, NY.
JONES, G., WILLETT, P., GLEN, R. C., LEACH, A. R., AND TAYLOR, R. 1997. Development and validation of a
genetic algorithm for flexible docking. J. Mol. Biol. 267, 727–748.
KARELSON, M. 2000. Molecular Descriptors in QSAR/QSPR. Wiley-VCH, Weinheim, Germany.
KITCHEN, D. B., DECORNEZ, H., FURR, J. R., AND BAJORATH, J. 2004. Docking and scoring in virtual screening
for drug discovery: Methods and applications. Nature Rev. Drug Discov. 3, 935–949.
KUNTZ, I. D., BLANEY, J. M., OATLEY, S. J., LANGRIDGE, R., AND FERRIN, T. E. 1982. A geometric approach to
macromolecule-ligand interactions. J. Mol. Biol. 161, 269–288.
LEACH, A. R. 2001. Molecular Modelling: Principles and Applications, 2nd ed. Prentice Hall, Harlow, U.K.
LEACH, A. R. AND GILLET, V. J. 2003. An Introduction to Chemoinformatics. Kluwer Academic Publishers,
Dordrecht, The Netherlands.
LEWELL, X. Q., JUDD, D. B., WATSON, S. P., AND HANN, M. M. 1998. RECAP—retrosynthetic analysis proce-
dure: A powerful new technique for identifying privileged molecular fragments with useful applications
in combinatorial chemistry. J. Chem. Inf. Comput. Sci. 38, 511–522.
LIPINSKI, C. A., LOMBARDO, F., DOMINY, B. W., AND FEENEY, P. J. 2001. Experimental and computational ap-
proaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug
Deliv. Rev. 46, 3–26.
LIVINGSTONE, D. J. 2000. The characterization of chemical structures using molecular properties. A survey.
J. Chem. Inf. Comput. Sci. 40, 195–209.
LYNCH, M. F. 2004. Introduction of computers in chemical structure information systems, or what is not
recorded in the annals. In The History and Heritage of Scientific and Technological Information Systems:
Proceedings of the 2002 Conference, W. B. Rayward and M. E. Bowden, Eds. Information Today, Inc.,
Medford, NJ, 137–148.
MARKUSH, E. A. 1924. Pyrazolone dye and process of making the same. U.S. Patent No. 1,506,316, August
26.
MIGLIAVACCA, E. 2003. Applied introduction to multivariate methods used in drug discovery. Mini Rev. Med.
Chem. 3, 831–843.
MORGAN, H. L. 1965. The generation of a unique machine description for chemical structures—a technique
developed at chemical abstracts service. J. Chem. Doc. 5, 107–113.
NICOLAOU, C. A., BROWN, N., AND PATTICHIS, C. S. 2007. Molecular optimization using multi-objective meth-
ods. Curr. Opin. Drug Discov. Devel. 10, 316–324.
OPREA, T. Ed. 2005a. Chemoinformatics in Drug Discovery. Wiley-VCH, Weinheim, Germany.
OPREA, T. 2005b. Is safe exchange of data possible? Chem. Eng. News 83, 24–29.
PEARLMAN, R. S. 1987. Rapid generation of high quality approximate 3D molecular structures. Chem. Des.
Automa. News 2, 5–7.
RAEVSKY, O. A. 2004. Physicochemical descriptors in property-based drug design. Mini Rev. Med. Chem. 4,
1041–1052.
REICH, H. J. AND CRAM, D. J. 1969. Macro rings. XXXVII. Multiple electrophilic substitution reactions of
[2,2]paracyclophanes and interconversions of polysubstituted derivatives. J. Am. Chem. Soc. 91, 3527–
3533.
ROGERS, D., BROWN, R. D., AND HAHN, M. 2005. Using extended-connectivity fingerprints with Laplacian-
modified Bayesian analysis in high-throughput screening follow-up. J. Biomol. Screen. 10, 682–
686.
RUSSO, E. 2002. Chemistry plans a structural overhaul. Nature Jobs 419, 4–7.
SCHNEIDER, G. AND FECHNER, U. 2005. Computer-based de novo design of drug-like molecules. Nature Rev.
Drug Discov. 4, 649–663.
SCHUFFENHAUER, A. AND BROWN, N. 2006. Chemical diversity and biological activity. Drug Discov. Today:
Technol. 3, 387–395.
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.
8:38 N. Brown
SCHUFFENHAUER, A., BROWN, N., SELZER, P., ERTL, P., AND JACOBY, E. 2006. Relationships between molecular
complexity, biological activity, and structural activity. J. Chem. Inf. Mod. 46, 525–535.
SCHUFFENHAUER, A., FLOERSHEIM, P., ACKLIN, P., AND JACOBY, E. 2003. Similarity metrics for ligands reflecting
the similarity of the target proteins. J. Chem. Inf. Comput. Sci. 43, 391–405.
SCHUFFENHAUER, A., BROWN, N., ERTL, P., JENKINS, J. L., SELZER, P., AND HAMON, J. 2007. Clustering and rule-
based classifications of chemical structures evaluated in the biological activity space. J. Chem. Inf. Mod.
47, 325–336.
SNAREY, M., TERRETT, N. K., WILLETT, P., AND WILTON, D. J. 1997. Comparison of algorithms for dissimilarity-
based compound selection. J. Mol. Graph. Mod. 15, 372–385.
TODESCHINI, R. AND CONSONNI, V. 2000. Handbook of Molecular Descriptors. Wiley-VCH, Weinheim,
Germany.
WEININGER, D. 1988. Smiles a chemical language and information system. 1. Introduction to methodology
and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36.
WILLETT, P. 1991. Three-Dimensional Chemical Structure Handling. Research Studies Press, Balclock,
Hertfordshine, U.K.
WILLETT, P. 2000. Textual and chemical information processing: Different domains but similar algorithms.
Inform. Res. 5, http://informationr.net/ir/5-2/paper69.html.
WILLETT, P., BARNARD, J. M., AND DOWNS, G. M. 1998. Chemical similarity searching. J. Chem. Inf. Comput.
Sci. 38, 983–996.
ACM Computing Surveys, Vol. 41, No. 2, Article 8, Publication date: February 2009.