A new 3D molecular structure representation using quantum topology with application to structure–property relationships
Introduction
The underlying assumption in rational drug design is that it is possible to predict the property or biological activity of compounds by knowing their molecular structure. This has lead the field of quantitative structure–activity/property relationships (QSAR/QSPR) to develop ways to encode chemical structure for use with advanced chemometric and artificial intelligence methods. A successful QSAR/QSPR model is not only constructed to correctly estimate the numerical value of the property or biological activity, but also to give a deeper understanding of what structural features are important for the observed property/activity. The goal for the chemometric/machine learning methods is to create a chemical model that is easy to understand.
To accomplish this, it is crucial that the 3D molecular structure is represented in the best possible way. This does not only mean the 3D geometrical structure, but also a proper description of the 3D electronic structure based on quantum mechanics. This task has proven to be difficult. The two main reasons are: (a) attribute-based methods such as partial least squares regression (PLS) have problems with representing 3D geometrical structures in general [1], [2] and (b) quantum theory-based 3D electronic descriptors can be difficult to use and interpret [3], [4], [5], [6], [7], [8]. In this article, we shall mainly focus on the latter of these two problems, but we recognize that the two problems are not independent.
According to quantum theory, all measurable properties of a molecule are contained in the wave function, which is a solution to the Schrödinger equation [9]. Could the electronic structure of molecules be described by using the electronic wave function itself directly? In principle, this should be possible, however, there are two main problems with this approach. Firstly, the wave functions are cumbersome to interpret. Secondly, these objects are higher dimensional continuous functions that are difficult to represent.
Instead of using the wave function directly, many QSAR/QSPR methods have been developed to extract calculated properties from it, such as the electrostatic potential, charge, dipole moments etc. However, selecting these property values can sometimes be a problem because it is difficult to know when enough descriptors for a specific QSAR/QSPR problem have been selected. It would, therefore, be better to have a representation that is more general in order to obtain a unified approach.
The comparative field analysis (CoMFA) technique [10], [11], [12], [13] is probably the most popular 3D-QSAR/QSPR method in use today. It tries to capture the 3D distribution of the electrostatic potential distribution using a positively charged probe in multiple positions arranged in a lattice around the molecules. The measured property at each lattice point is allocated a separate variable in a data matrix that is subsequently analysed with PLS. However, there is a serious problem with this representation. Since the sampling occurs in 3D space, it is vital that the molecules are oriented in a comparable way. Even small perturbations in the relative orientation of the molecules can cause serious problems since one attribute (i.e. a space point) may be forced to be comparable to the wrong space location. This can be seen as a 3D version of the problem of using multivariate analysis on spectra with significant peak shifts [14], [15], [16]. Thus, the alignment procedure itself has a pivotal role in the success of these methods. This is in particular true for QSAR/QSPR problems where there are no obvious common structural framework that can be used as an aid in the alignment.
The other serious problem is related to the sampling density. At what density level should the lattice points be sampled to obtain a good representation of the potential? Since the whole point of a QSAR/QSPR study is to localize the important structural features relevant for the property/activity in question, very little information is available to properly solve this problem. The best strategy is to use as many sampling points as possible, and maybe reduce them later using a variable selection procedure (such as the GOLPE [17], [18], [19] and the G-WHIM [20] methods). The other side of the sampling problem is that the number of variables needed in CoMFA models can be very large. Each variable carry little chemical information and interpretation of the QSAR/QSPR model can be difficult.
What we want from a structure representation is a set of variables that contain a maximum of chemical information without making any compromises with respect to quantum mechanics. A promising candidate for such a representation appears at first glance to be the 3D electron density distribution ρ(r) [21], [22]. It is firmly rooted in quantum mechanics, and most chemists feel comfortable with it. However, representing this 3D function as sampled lattice points would cause the same problems as faced with CoMFA and related methods [23], [24].
A solution to this problem would be to find a highly compressed representation of the electronic density that can also be related to intuitive chemical concepts. Fortunately, there is a theory that provides such a representation: The quantum topological atoms in molecules (AIM) theory [25], [26], [27], [28] pioneered by Professor Richard F.W. Bader at McMaster University in Canada. As will be shown in this article, AIM theory enables us to construct an electronic representation which (a) uses a small number of variables to encode the electronic structure, (b) contains significant chemical information for each variable, (c) is directly connected to quantum mechanics and (d) is easy to interpret.
Section snippets
Compression of the 3D electron density distribution
AIM theory makes a link between quantum mechanics and standard chemical concepts such as an atom and a chemical bond. There is no explicit concept of an atom or a bond in the Schrödinger's equation. It is only concerned with particles (electrons and nuclei) in potential fields. However, a lot of the standard chemical knowledge is based on the atomic model. So, on one hand, we want to make use of the rigor and physically correct quantum mechanics in our calculations, and on the other hand, we
The AIM descriptor matrix
Here, a scheme for comparing molecules that is specially designed to work for attribute-based data analytical methods such as PLS is outlined. Of course, our StruQT approach can also be used with non-attribute-based machine learning methods such as inductive logic programming (ILP) [1], [2].
In our structure representation, each molecule is contained in a matrix (here referred to as the AIM descriptor matrix, ADM), where each row contains the information about a single critical point. Our
Selection of data set
In order to test out the StruQT approach, we have decided to use a data set where comparison between the different molecules is straightforward. It was also desirable to have a data set where there is some quantum theoretical understanding of the connection between the structure and the property/activity studied. Based on this, we chose a data set where the aim is to predict the wavelength λmax of the UV absorption maximum of the first π–π* excitation.
The set of compounds studied are
PLS analysis using all variables
Before performing any regression analysis, three compounds were taken out as to be the independent validation set. These compounds are:
- •
6′-Hydroxyflavylium, no. 1.
- •
4′-Dihydroxyflavylium, no. 3.
- •
5,7,3′,4′,5′-Tetrahydroxyflavylium, no. 17.
These compounds were selected as they spanned the variation in the data. The numbering of the compounds corresponds to the one used in Table 2.
For this PLS analysis, the routine crossval in the PLS_toolbox [56] was used. The maximum number of PLS factors tested for
Conclusion
We have demonstrated that the StruQT approach can be used to successfully form QSPR models that are easy to interpret and have high predictive ability.
Inspection of our anthocyanidin QSPR models made it possible to identify the importance of the π system through the ellipticity parameter and more importantly, to identify which bonds are involved in either decreasing or increasing the value of λmax.
Acknowledgements
NG wishes to thank the BBSRC for financial support (grant no. 2/B11471). We express our gratitude to Dr. Knut J. Børve at the University of Bergen, Norway for kindly providing the optimized anthocyanidin structures and commenting on the manuscript. Dr. Richard Gilbert is thanked for providing the GP program used.
References (62)
- et al.
A comparative molecular field analysis and molecular modelling studies on pyridylimidazole type of angiotensin II antagonists
Bioorg. Med. Chem.
(1999) - et al.
Synthesis, antiplatelet activity and comparative molecular field analysis of substituted 2-amino4h-pyrido[1,2-a]pyrimidin-4-ones, their congeners and isosteric analogues
Bioorg. Med. Chem.
(2000) - et al.
Shift and intensity modeling in spectroscopy—general concept and applications
Chemom. Intell. Lab. Syst.
(1999) - et al.
Improving the interpretation of multivariate and rule induction models by using a peak parameter representation
Chemom. Intell. Lab. Syst.
(1997) - et al.
Modeling and prediction of molecular properties: theory of grid-weighted holistic invariant molecular (G-WHIM) descriptors
Chemom. Intell. Lab. Syst.
(1997) Molecular reference (MOLREF), a new method in quantitative structure–activity relationships (QSAR)
Chemom. Intell. Lab. Syst.
(1990)- et al.
Computational study of the electronic excitations of some anthocyanidins
Spectrochim. Acta, Part A
(1998) - et al.
Colour and stability of the six common anthocyanidin 3-glucosides in aqueous solutions
Food Chem.
(2000) - et al.
Color and substitution pattern in anthocyanidins. a combined quantum chemical–chemometrical study
Spectrochim. Acta, Part A
(1999) MORPHY, a program for an automated atoms in molecules analysis
Comput. Phys. Commun.
(1996)