QSAR study of anti-HIV HEPT analogues based on multi-objective genetic programming and counter-propagation neural network

https://doi.org/10.1016/j.chemolab.2006.01.009Get rights and content

Abstract

Quantitative structure–activity relationship (QSAR) has been developed for a set of inhibitors of the human immunodeficiency virus 1 (HIV-1) reverse transcriptase, derivatives of 1-[(2-hydroxyethoxy)methyl]-6-(phenylthio)thymine (HEPT). Structural descriptors used in this study are Hansch constants for each substituent and topological descriptors. We have applied the variable selection method based on multi-objective genetic programming (GP) to the HEPT data and constructed the nonlinear QSAR model using counter-propagation (CP) neural network with the selected variables. The obtained network is accurate and interpretable. Moreover in order to confirm a predictive ability of the model, a validation test was performed.

Introduction

In quantitative structure–activity relationship (QSAR) studies, chemical structures are represented by calculated descriptors such as topological, geometric, electrostatic, quantum chemical, thermodynamic descriptors. Then these chemical descriptors are used to construct a statistical model between chemical structures and its biological activities or chemical properties. Since many groups have proposed various methods for representation of chemical structures, there are many kinds of available chemical descriptors. On the other hand, in many cases we cannot obtain enough amounts of chemical structures with their activities because biological experiments are relatively expensive. In this situation, we encounter the over-fitting problem and need to select important variables to avoid this problem. As a result of appropriate variable selection procedure, more stable and predictive QSAR model will be constructed. Moreover if the model contains only small number of significant variables, it is expected that the model can be interpreted more easily. Therefore in order to overcome the over-fitting problem, various methods for variable selection have been proposed, e.g. [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]. Optimization methodologies used in these studies are genetic algorithm (GA), genetic programming (GP), simulated annealing (SA) or ant colony optimization (ACO) algorithm.

For example we have proposed the variable selection method called GA-based PLS (GAPLS) [12]. In this method, the value of Q2 obtained from partial least squares (PLS) regression is used as evaluation criteria of each individual. Thus various possible models having the different number of variables are proposed as a result of GA search. Scientists can select the best model from these models based on balance between the number of variables and the value of R2 and Q2.

Main purpose of variable selection is to reduce the number of variables in statistical model, but, on the other hand, accuracy of the model has to be kept at high level. Needless to say, predictive power and interpretability of the model are also important factors. Therefore these factors have to be satisfied simultaneously in variable selection and subsequent statistical modeling. In many cases, however, it is quite difficult because these two factors conflict with each other. This type of problem is known as multi-objective optimization, and various studies had been carried out in the field of chemistry [13], [14], [15] and surveyed by Coello [16].

One of the most promising techniques to solve this contradiction is to use the concept of Pareto optimal. Recently, Nicolotti et al. [17] proposed the method of variable selection using GP and multi-objective optimization. In their study, they successfully constructed the QSAR model of the Selwood data set and two other solubility data sets.

In this study, we propose a novel method for variable selection by extending Nicolotti's idea. This method has been applied to HIV-1 reverse transcriptase inhibitors to construct a QSAR model. The 34 descriptors were calculated for HEPT derivatives and then 11 variables were selected by proposed variable selection method. Then, in order to build a more predictive nonlinear QSAR model, counter-propagation (CP) neural network was trained using the selected 11 variables.

Section snippets

Structure activity data

77 HEPT derivatives with inhibitory activities for the HIV-1 reverse transcriptase were used as a data set. HEPT is non-nucleoside reverse transcriptase inhibitor (NNRTI) with potent anti-HIV-1 activity, and various QSAR studies have been published [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28]. The chemical structures and inhibitory activities used in this study are shown in Fig. 1 and Table 1. The activity values pIC50, negative logarithm of molar concentration required to

Results and discussion

First, we constructed the PLS model using all 34 variables. The model has two latent variables and R2 = 0.565 and Q2 = 0.405, respectively. This result was not acceptable because the values of R2 and Q2 are very low. Then we added square terms of every descriptor to construct nonlinear QSAR model. The obtained PLS model using 68 variables indicated R2 = 0.718 and Q2 = 0.431 with five component. Although slight improvement was observed, it can be said that accuracy and predictive power of the model are

Conclusion

In this paper, we described the variable selection method using the GP and a multi-objective optimization. The QSAR model could be optimized in the view of accuracy, predictivity and interpretability by using the concept of Pareto optimum. To confirm performance of our variable selection method, the QSAR analysis of the HIV-1 reverse transcriptase inhibitors was carried out. The 34 structural descriptors of the HEPT derivatives were calculated and then 11 variables were selected using the

References (37)

  • J. Wikel et al.

    Bioorg. Med. Chem. Lett.

    (1993)
  • A. Hoskuldsson

    Chemom. Intell. Lab. Syst., Lab. Inf. Manag.

    (2001)
  • R. Leardi et al.

    Chemom. Intell. Lab. Syst., Lab. Inf. Manag.

    (1998)
  • C. Hansch et al.

    Bioorg. Med. Chem. Lett.

    (1992)
  • A. Seri-Levy et al.

    Eur. J. Med. Chem.

    (1994)
  • S. Gayen et al.

    Bioorg. Med. Chem.

    (2004)
  • H. Kubinyi

    Quant. Struct.-Act. Relat. Pharmacol. Chem. Biol.

    (1994)
  • H. Kubinyi

    Quant. Struct.-Act. Relat. Pharmacol. Chem. Biol.

    (1994)
  • S.S. So et al.

    J. Med. Chem.

    (1996)
  • S.S. So et al.

    J. Med. Chem.

    (1996)
  • K. Hasegawa et al.

    J. Chem. Inf. Comput. Sci.

    (1999)
  • A. Yasri et al.

    J. Chem. Inf. Comput. Sci.

    (2001)
  • S.J. Cho et al.

    J. Chem. Inf. Comput. Sci.

    (2002)
  • S.S. Liu et al.

    J. Chem. Inf. Comput. Sci.

    (2003)
  • K. Hasegawa et al.

    J. Chem. Inf. Comput. Sci.

    (1997)
  • L. Elliott et al.

    Ind. Eng. Chem. Res.

    (2003)
  • A. Tarafder et al.

    Ind. Eng. Chem. Res.

    (2005)
  • K. Mitra et al.

    Ind. Eng. Chem. Res.

    (2004)
  • Cited by (25)

    • Optimal policies for control of the novel coronavirus disease (COVID-19) outbreak

      2020, Chaos, Solitons and Fractals
      Citation Excerpt :

      After a number of generations, good traits dominate the population and augment the quality of solutions. Due to genetic algorithm's fascinating features and its strong convergence, up to now, researchers have developed this algorithm and used that to address a wide variety of problems in different fields of study [19,20]. A multi-objective approach is used to find optimal decision rules as the Pareto frontier.

    • Counter propagation auto-associative neural network based data imputation

      2015, Information Sciences
      Citation Excerpt :

      This neural network architecture has successfully found applications in digital image copyright authentication [16], data compression, approximation, classification tasks, etc. Furthermore, it is most often used by the chemometric community [3,4,6,7,8,12,65,73,96,97]. Since the CPNN and the CPAANN differ only in the output layer, for the sake of brevity, the training algorithm of CPNN is not provided here.

    • Fast optimization of hyperparameters for support vector regression models with highly predictive ability

      2015, Chemometrics and Intelligent Laboratory Systems
      Citation Excerpt :

      However, when relationships between X and y are nonlinear, linear regression models cannot represent these relationships. Nonlinear regression methods such as back-propagation neural network [5], counter-propagation neural network [6], kernel PLS [7], Gaussian process [8] and support vector regression (SVR) [9] are required. In this study, we focus on SVR because of its theoretical background and the Gaussian kernel is usually used in SVR modeling.

    • QSAR models for HEPT derivates as NNRTI inhibitors based on Monte Carlo method

      2014, European Journal of Medicinal Chemistry
      Citation Excerpt :

      The importance of quantitative structure–activity relationship (QSAR) methods in modern drug design is well established since QSAR can make the early prediction of activity-related characteristics of drug candidates and can eliminate molecules with undesired properties [15]. A number of QSAR studies have been reported for HEPT compounds [16–28]. Thousands of molecular descriptors used in QSAR studies have been defined to encode chemical and structural features of molecules [29,30].

    View all citing articles on Scopus
    View full text