A new 3D molecular structure representation using quantum topology with application to structure–property relationships

https://doi.org/10.1016/S0169-7439(00)00101-5Get rights and content

Abstract

We present a new 3D molecular structure representation based on Richard F.W. Bader's quantum topological atoms in molecules (AIM) theory for use in quantitative structure–property/activity relationship (QSPR/QSAR) modeling. Central to this structure representation using quantum topology (StruQT) are critical points located on the electron density distribution of the molecules. Other gradient fields such as the Laplacian of the electron density distribution can also be used. The type of critical point of particular interest is the bond critical point (BCP) which is here characterized by using the following three parameters: electron density ρ, the Laplacian 2ρ and the ellipticity ϵ. This representation has the advantage that there is no need to probe a large number of lattice points in 3D space to capture the important parts of the 3D electronic structure as is necessary in, e.g. comparative field analysis (CoMFA).

We tested the new structure representation by predicting the wavelength of the lowest UV transition for a system of 18 anthocyanidins. Different quantitative structure–property relationship (QSPR) models are constructed using several chemometric/machine learning methods such as standard partial least squares regression (PLS), truncated PLS variable selection, genetic algorithm-based variable selection and genetic programming (GP). These models identified bonds that either take part in decreasing or increasing the dominant excitation wavelength. The models also correctly emphasized on the involvement of the conjugated π system for predicting the wavelength through flagging the BCP ellipticity parameters as important for this particular data set.

Introduction

The underlying assumption in rational drug design is that it is possible to predict the property or biological activity of compounds by knowing their molecular structure. This has lead the field of quantitative structure–activity/property relationships (QSAR/QSPR) to develop ways to encode chemical structure for use with advanced chemometric and artificial intelligence methods. A successful QSAR/QSPR model is not only constructed to correctly estimate the numerical value of the property or biological activity, but also to give a deeper understanding of what structural features are important for the observed property/activity. The goal for the chemometric/machine learning methods is to create a chemical model that is easy to understand.

To accomplish this, it is crucial that the 3D molecular structure is represented in the best possible way. This does not only mean the 3D geometrical structure, but also a proper description of the 3D electronic structure based on quantum mechanics. This task has proven to be difficult. The two main reasons are: (a) attribute-based methods such as partial least squares regression (PLS) have problems with representing 3D geometrical structures in general [1], [2] and (b) quantum theory-based 3D electronic descriptors can be difficult to use and interpret [3], [4], [5], [6], [7], [8]. In this article, we shall mainly focus on the latter of these two problems, but we recognize that the two problems are not independent.

According to quantum theory, all measurable properties of a molecule are contained in the wave function, which is a solution to the Schrödinger equation [9]. Could the electronic structure of molecules be described by using the electronic wave function itself directly? In principle, this should be possible, however, there are two main problems with this approach. Firstly, the wave functions are cumbersome to interpret. Secondly, these objects are higher dimensional continuous functions that are difficult to represent.

Instead of using the wave function directly, many QSAR/QSPR methods have been developed to extract calculated properties from it, such as the electrostatic potential, charge, dipole moments etc. However, selecting these property values can sometimes be a problem because it is difficult to know when enough descriptors for a specific QSAR/QSPR problem have been selected. It would, therefore, be better to have a representation that is more general in order to obtain a unified approach.

The comparative field analysis (CoMFA) technique [10], [11], [12], [13] is probably the most popular 3D-QSAR/QSPR method in use today. It tries to capture the 3D distribution of the electrostatic potential distribution using a positively charged probe in multiple positions arranged in a lattice around the molecules. The measured property at each lattice point is allocated a separate variable in a data matrix that is subsequently analysed with PLS. However, there is a serious problem with this representation. Since the sampling occurs in 3D space, it is vital that the molecules are oriented in a comparable way. Even small perturbations in the relative orientation of the molecules can cause serious problems since one attribute (i.e. a space point) may be forced to be comparable to the wrong space location. This can be seen as a 3D version of the problem of using multivariate analysis on spectra with significant peak shifts [14], [15], [16]. Thus, the alignment procedure itself has a pivotal role in the success of these methods. This is in particular true for QSAR/QSPR problems where there are no obvious common structural framework that can be used as an aid in the alignment.

The other serious problem is related to the sampling density. At what density level should the lattice points be sampled to obtain a good representation of the potential? Since the whole point of a QSAR/QSPR study is to localize the important structural features relevant for the property/activity in question, very little information is available to properly solve this problem. The best strategy is to use as many sampling points as possible, and maybe reduce them later using a variable selection procedure (such as the GOLPE [17], [18], [19] and the G-WHIM [20] methods). The other side of the sampling problem is that the number of variables needed in CoMFA models can be very large. Each variable carry little chemical information and interpretation of the QSAR/QSPR model can be difficult.

What we want from a structure representation is a set of variables that contain a maximum of chemical information without making any compromises with respect to quantum mechanics. A promising candidate for such a representation appears at first glance to be the 3D electron density distribution ρ(r) [21], [22]. It is firmly rooted in quantum mechanics, and most chemists feel comfortable with it. However, representing this 3D function as sampled lattice points would cause the same problems as faced with CoMFA and related methods [23], [24].

A solution to this problem would be to find a highly compressed representation of the electronic density that can also be related to intuitive chemical concepts. Fortunately, there is a theory that provides such a representation: The quantum topological atoms in molecules (AIM) theory [25], [26], [27], [28] pioneered by Professor Richard F.W. Bader at McMaster University in Canada. As will be shown in this article, AIM theory enables us to construct an electronic representation which (a) uses a small number of variables to encode the electronic structure, (b) contains significant chemical information for each variable, (c) is directly connected to quantum mechanics and (d) is easy to interpret.

Section snippets

Compression of the 3D electron density distribution

AIM theory makes a link between quantum mechanics and standard chemical concepts such as an atom and a chemical bond. There is no explicit concept of an atom or a bond in the Schrödinger's equation. It is only concerned with particles (electrons and nuclei) in potential fields. However, a lot of the standard chemical knowledge is based on the atomic model. So, on one hand, we want to make use of the rigor and physically correct quantum mechanics in our calculations, and on the other hand, we

The AIM descriptor matrix

Here, a scheme for comparing molecules that is specially designed to work for attribute-based data analytical methods such as PLS is outlined. Of course, our StruQT approach can also be used with non-attribute-based machine learning methods such as inductive logic programming (ILP) [1], [2].

In our structure representation, each molecule is contained in a matrix (here referred to as the AIM descriptor matrix, ADM), where each row contains the information about a single critical point. Our

Selection of data set

In order to test out the StruQT approach, we have decided to use a data set where comparison between the different molecules is straightforward. It was also desirable to have a data set where there is some quantum theoretical understanding of the connection between the structure and the property/activity studied. Based on this, we chose a data set where the aim is to predict the wavelength λmax of the UV absorption maximum of the first π–π* excitation.

The set of compounds studied are

PLS analysis using all variables

Before performing any regression analysis, three compounds were taken out as to be the independent validation set. These compounds are:

  • 6′-Hydroxyflavylium, no. 1.

  • 4′-Dihydroxyflavylium, no. 3.

  • 5,7,3′,4′,5′-Tetrahydroxyflavylium, no. 17.

These compounds were selected as they spanned the variation in the data. The numbering of the compounds corresponds to the one used in Table 2.

For this PLS analysis, the routine crossval in the PLS_toolbox [56] was used. The maximum number of PLS factors tested for

Conclusion

We have demonstrated that the StruQT approach can be used to successfully form QSPR models that are easy to interpret and have high predictive ability.

Inspection of our anthocyanidin QSPR models made it possible to identify the importance of the π system through the ellipticity parameter and more importantly, to identify which bonds are involved in either decreasing or increasing the value of λmax.

Acknowledgements

NG wishes to thank the BBSRC for financial support (grant no. 2/B11471). We express our gratitude to Dr. Knut J. Børve at the University of Bergen, Norway for kindly providing the optimized anthocyanidin structures and commenting on the manuscript. Dr. Richard Gilbert is thanked for providing the GP program used.

References (62)

  • P.L.A Popelier

    A method to integrate an atom in a molecule without explicit representation of the interatomic surface

    Comput. Phys. Commun.

    (1998)
  • B.K Alsberg

    Wavelets in parsimonious functional data analysis models

  • R.D King et al.

    Prediction of rodent carcinogenicity bioassays from molecular structure using inductive logic programming

    Environ. Health Perspect.

    (1996)
  • R.D King et al.

    Structure–activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming

    Proc. Natl. Acad. Sci. U. S. A.

    (1996)
  • R Carbo et al.

    An electron density measure of the similarity between two compounds

    Int. J. Quantum Chem.

    (1980)
  • R Carbo et al.

    LCAO–MO similarity measures and taxonomy

    Int. J. Quantum Chem.

    (1987)
  • J Cioslowski et al.

    Assessing molecular similarity from results of ab initio electronic structure calculations

    J. Am. Chem. Soc.

    (1991)
  • R Ponec et al.

    Molecular basis of quantitative structure–properties relationships (qspr): a quantum similarity approach

    J. Comput.-Aided Mol. Des.

    (1999)
  • A.C Good

    From fields to pharmacophores: a historical perspective of explicit 3d molecular similarity calculations

    Internet J. Chem.

    (2000)
  • B Beck et al.

    Some biological applications of semiempirical MO theory

  • R Shankar

    Principles of Quantum Mechanics

    (1994)
  • R.D Cramer et al.

    Comparative molecular field analysis (CoMFA): 1. Effect of shape on binding of steroids to carrier proteins

    J. Am. Chem. Soc.

    (1988)
  • C Marot et al.

    Comparative molecular field analysis of selective cyclooxygenase-2 (cox-2) inhibitors

    Quant. Struct.-Act. Relat.

    (2000)
  • K.M Wang et al.

    Alignment of curves by dynamic time warping

    Ann. Stat.

    (1997)
  • M Baroni et al.

    Generating optimal linear PLS estimations (GOLPE): an advanced chemometric tool for handling 3D-QSAR problems

    Quant. Struct.-Act. Relat.

    (1993)
  • G Cruciani et al.

    GOLPE-guided region selection

    Perspect. Drug Discovery Des.

    (1998)
  • G Cruciani et al.

    Comparative molecular-field analysis using grid force-field and golpe variable selection methods in a study of inhibitors of glycogen-phosphorylase-b

    J. Med. Chem.

    (1994)
  • J.W.M Nissink et al.

    Superposition of molecules: electron density fitting by application of fourier transforms

    J. Comput. Chem.

    (1997)
  • C.M Breneman et al.

    The use of electron density-derived TAE molecular descriptors in QSAR and QSPR

    Abstr. Pap.-Am. Chem. Soc.

    (1998)
  • R.J Vaz

    Use of electron densities in comparative molecular field analysis (CoMFA): a quantitative structure activity relationship (QSAR) for electronic effects of groups

    Quant. Struct.-Act. Relat.

    (1997)
  • R.J Vaz et al.

    Use of electron densities in comparative molecular field analysis (CoMFA): OH bond dissociation energies in phenols

    Int. J. Quant. Chem.

    (1999)
  • Cited by (0)

    View full text