Developing Non-linear Rate Constant QSPR using Decision Trees and Multi-Gene Genetic Programming

https://doi.org/10.1016/B978-0-444-64241-7.50407-9Get rights and content

Abstract

Developing a QSPR model, which not only captures the influence of reactant structures but also the solvent effect on reaction rate, is of significance. Such QSPR models will serve as a prerequisite for the simultaneous computer-aided molecular design (CAMD) of reactants, products and solvents. They will also be useful in predicting the rate constant without entirely relying on experiments. To develop such a QSPR, recently, Datta et al. (2017) used the Diels-Alder reaction as a case study. Their model displayed great promise, but, there is scope for improvement in the model’s predictive ability. In our work, we improve upon their model by introducing non-linearity. This is achieved using multi-gene genetic programming (MGGP). In our methodology, a combination of genetic algorithm (GA) and directed trees was used to develop a branched version of chromosomes, allowing additional possibilities in the generated models. In our work, prior to model development through MGGP, principal component analysis (PCA) was conducted. Lastly, models were evaluated based on metrics such as R2, Q2, and RMSE.

Introduction

The search for QSPR models to predict influence of structures of reactants and solvents on reaction rate constant has long been a challenge. According to Roy et al. (2015), QSPR (Quantitative Structure Property Relationship) models are generally linear or non-linear mathematical relationships that correlate a particular property or activity of a chemical species with their structure. Such structures are generally represented numerically by descriptors, which can be determined experimentally or theoretically as per the definition. Early attempts to develop QSPR models for the prediction of rate constant of a reaction have been limited. Either the rate constant was studied as a function of structures of reactants while keeping the solvent structure constant or the solvent structures were varied but the reactants’ structures were kept constant. With regards to the study of the influence of reactants’ structures, Chaudry and Popelier (2003) developed a property model to predict the rate constant of hydrolysis of esters by utilizing quantum chemical descriptors. Estrada and Matamala (2007) have used generalized topological indices to predict the gas phase reaction rate constants of alkanes and cycloalkanes with OH, Cl and NO3 radicals. Recently, Struebing et al. (2013) developed a methodology to design solvents by utilizing surrogate models and quantum chemical calculations. There is a need for QSPR models that capture the combined influence of reactants’ structures and solvent structures. Such models will serve two purposes: The first would be to quickly predict the rate constant without relying fully on experiments, while the second purpose will be the simultaneous design of reactants, products and solvents. With respect to QSPRs that capture reactant and solvent influence, Nandi et al. (2013) developed a quantitative structure-activation barrier relationship for the Diels-Alder reaction that utilizes quantum chemical descriptors. Their aim was to construct a relationship between the activation energy and the structures of the utilized reactants and solvent. However, their data set lacked solvent variety. Recently, Zhou et al. (2014) have studied a variety of solvents for the Diels-Alder reaction in their search for new solvent descriptors though they only used one set of reactants. Thus, we have combined the data sets utilized by Nandi et al. (2013) and Zhou et al. (2014) and created a set which offers more diversity in terms of the solvents utilized. We have utilized this more diverse data set to develop a rate constant model in terms of connectivity indices. It is worth noting that Nandi et al. (2013) relied on the data set utilized in the work of Tang et al. (2012). In our previous work, a hybrid GA-DT algorithm was designed to develop a linear model (Datta et al., 2017). In this work, we have proposed an efficient Multi-gene Genetic Programming (MGGP) algorithm using initialization by a modified DT algorithm for model development which utilizes the “divide and conquer” strategy in combination with Principal Component Analysis. This DT algorithm checks if the addition of branched gene improves model fitness. The MGGP algorithms hold promise as they possess the ability of developing models using a wide variety of nonlinear mathematical basis functions. Both internal and external validations were performed separately to determine model confidence. Additionally, model RMSE and R2 values in case of both external and internal validation were calculated as they describe model fitness of data.

The Diels-Adler reaction is a well-studied organic chemical reaction involving a conjugate diene and an alkene, which is also termed as dienophile. Evans and Johnson (1999) have discussed this reaction in detail in their work. This reaction involves cycloaddition of two reactants in the presence of a solvent. Earlier, Rideout and Breslow (1980) discussed the hydrophobic acceleration of the Diels-Alder reaction in their work. Specifically, the influence of hydrophobic cavity in organic structures for acceleration of the reaction rate was discussed. In both of the aforementioned works, the impact of the solvent on the rate constant of the reaction has been observed. This feature of this reaction makes it an appealing choice for the aimed study.

Section snippets

Methodology

From the work of Tang et al. (2012) and Zhou et al., 2014, Zhou et al., 2015, we generated a diverse data set of 72 reactions that consisted of 38 different dienophiles, 19 dienes and 10 solvents. All chemical species were designed using Avogadro software. The structures were optimized using MMFF94s, a built-in geometry optimization algorithm of Avogadro software, as suggested by Datta et al. (2015). The optimized geometries were saved as MOL files. These files were then used as input for

Results

As GPTIPS 2.0 gives the opportunity to use various numbers and depths of the genes, the objective was to develop a better model than presented by Datta et al. (2017) using a nonlinear model that contained the minimum number of descriptors possible. Although only connectivity index descriptors were used in this effort, it is beneficial to develop a model with fewer descriptors to increase the interpretability of the model. The population size and number of iterations were kept constant at

Conclusions

In this project, the aim was to develop a better model than that previously proposed for the same property using QSPR analysis (Datta et al., 2017). From the work of Datta et al. (2017), it was clear that the hybrid GA-DT method was very efficient in developing linear models of this type. Dev et al. (2017) also concluded that the hybrid GA-DT method provided the best possible model, thus developing a nonlinear model was the only option left to develop a better property model. From the results

References (18)

There are more references available in the full text version of this article.
View full text