QSAR study of anti-HIV HEPT analogues based on multi-objective genetic programming and counter-propagation neural network
Introduction
In quantitative structure–activity relationship (QSAR) studies, chemical structures are represented by calculated descriptors such as topological, geometric, electrostatic, quantum chemical, thermodynamic descriptors. Then these chemical descriptors are used to construct a statistical model between chemical structures and its biological activities or chemical properties. Since many groups have proposed various methods for representation of chemical structures, there are many kinds of available chemical descriptors. On the other hand, in many cases we cannot obtain enough amounts of chemical structures with their activities because biological experiments are relatively expensive. In this situation, we encounter the over-fitting problem and need to select important variables to avoid this problem. As a result of appropriate variable selection procedure, more stable and predictive QSAR model will be constructed. Moreover if the model contains only small number of significant variables, it is expected that the model can be interpreted more easily. Therefore in order to overcome the over-fitting problem, various methods for variable selection have been proposed, e.g. [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]. Optimization methodologies used in these studies are genetic algorithm (GA), genetic programming (GP), simulated annealing (SA) or ant colony optimization (ACO) algorithm.
For example we have proposed the variable selection method called GA-based PLS (GAPLS) [12]. In this method, the value of Q2 obtained from partial least squares (PLS) regression is used as evaluation criteria of each individual. Thus various possible models having the different number of variables are proposed as a result of GA search. Scientists can select the best model from these models based on balance between the number of variables and the value of R2 and Q2.
Main purpose of variable selection is to reduce the number of variables in statistical model, but, on the other hand, accuracy of the model has to be kept at high level. Needless to say, predictive power and interpretability of the model are also important factors. Therefore these factors have to be satisfied simultaneously in variable selection and subsequent statistical modeling. In many cases, however, it is quite difficult because these two factors conflict with each other. This type of problem is known as multi-objective optimization, and various studies had been carried out in the field of chemistry [13], [14], [15] and surveyed by Coello [16].
One of the most promising techniques to solve this contradiction is to use the concept of Pareto optimal. Recently, Nicolotti et al. [17] proposed the method of variable selection using GP and multi-objective optimization. In their study, they successfully constructed the QSAR model of the Selwood data set and two other solubility data sets.
In this study, we propose a novel method for variable selection by extending Nicolotti's idea. This method has been applied to HIV-1 reverse transcriptase inhibitors to construct a QSAR model. The 34 descriptors were calculated for HEPT derivatives and then 11 variables were selected by proposed variable selection method. Then, in order to build a more predictive nonlinear QSAR model, counter-propagation (CP) neural network was trained using the selected 11 variables.
Section snippets
Structure activity data
77 HEPT derivatives with inhibitory activities for the HIV-1 reverse transcriptase were used as a data set. HEPT is non-nucleoside reverse transcriptase inhibitor (NNRTI) with potent anti-HIV-1 activity, and various QSAR studies have been published [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28]. The chemical structures and inhibitory activities used in this study are shown in Fig. 1 and Table 1. The activity values pIC50, negative logarithm of molar concentration required to
Results and discussion
First, we constructed the PLS model using all 34 variables. The model has two latent variables and R2 = 0.565 and Q2 = 0.405, respectively. This result was not acceptable because the values of R2 and Q2 are very low. Then we added square terms of every descriptor to construct nonlinear QSAR model. The obtained PLS model using 68 variables indicated R2 = 0.718 and Q2 = 0.431 with five component. Although slight improvement was observed, it can be said that accuracy and predictive power of the model are
Conclusion
In this paper, we described the variable selection method using the GP and a multi-objective optimization. The QSAR model could be optimized in the view of accuracy, predictivity and interpretability by using the concept of Pareto optimum. To confirm performance of our variable selection method, the QSAR analysis of the HIV-1 reverse transcriptase inhibitors was carried out. The 34 structural descriptors of the HEPT derivatives were calculated and then 11 variables were selected using the
References (37)
- et al.
Bioorg. Med. Chem. Lett.
(1993) Chemom. Intell. Lab. Syst., Lab. Inf. Manag.
(2001)- et al.
Chemom. Intell. Lab. Syst., Lab. Inf. Manag.
(1998) - et al.
Bioorg. Med. Chem. Lett.
(1992) - et al.
Eur. J. Med. Chem.
(1994) - et al.
Bioorg. Med. Chem.
(2004) Quant. Struct.-Act. Relat. Pharmacol. Chem. Biol.
(1994)Quant. Struct.-Act. Relat. Pharmacol. Chem. Biol.
(1994)- et al.
J. Med. Chem.
(1996) - et al.
J. Med. Chem.
(1996)
J. Chem. Inf. Comput. Sci.
J. Chem. Inf. Comput. Sci.
J. Chem. Inf. Comput. Sci.
J. Chem. Inf. Comput. Sci.
J. Chem. Inf. Comput. Sci.
Ind. Eng. Chem. Res.
Ind. Eng. Chem. Res.
Ind. Eng. Chem. Res.
Cited by (25)
Optimal policies for control of the novel coronavirus disease (COVID-19) outbreak
2020, Chaos, Solitons and FractalsCitation Excerpt :After a number of generations, good traits dominate the population and augment the quality of solutions. Due to genetic algorithm's fascinating features and its strong convergence, up to now, researchers have developed this algorithm and used that to address a wide variety of problems in different fields of study [19,20]. A multi-objective approach is used to find optimal decision rules as the Pareto frontier.
Counter propagation auto-associative neural network based data imputation
2015, Information SciencesCitation Excerpt :This neural network architecture has successfully found applications in digital image copyright authentication [16], data compression, approximation, classification tasks, etc. Furthermore, it is most often used by the chemometric community [3,4,6,7,8,12,65,73,96,97]. Since the CPNN and the CPAANN differ only in the output layer, for the sake of brevity, the training algorithm of CPNN is not provided here.
Fast optimization of hyperparameters for support vector regression models with highly predictive ability
2015, Chemometrics and Intelligent Laboratory SystemsCitation Excerpt :However, when relationships between X and y are nonlinear, linear regression models cannot represent these relationships. Nonlinear regression methods such as back-propagation neural network [5], counter-propagation neural network [6], kernel PLS [7], Gaussian process [8] and support vector regression (SVR) [9] are required. In this study, we focus on SVR because of its theoretical background and the Gaussian kernel is usually used in SVR modeling.
QSAR models for HEPT derivates as NNRTI inhibitors based on Monte Carlo method
2014, European Journal of Medicinal ChemistryCitation Excerpt :The importance of quantitative structure–activity relationship (QSAR) methods in modern drug design is well established since QSAR can make the early prediction of activity-related characteristics of drug candidates and can eliminate molecules with undesired properties [15]. A number of QSAR studies have been reported for HEPT compounds [16–28]. Thousands of molecular descriptors used in QSAR studies have been defined to encode chemical and structural features of molecules [29,30].
Tailored scoring function of Trypsin-benzamidine complex using COMBINE descriptors and support vector regression
2008, Chemometrics and Intelligent Laboratory SystemsEffect of an antiviral drug control and its variable order fractional network in host COVID-19 kinetics
2022, European Physical Journal: Special Topics