Developing Non-linear Rate Constant QSPR using Decision Trees and Multi-Gene Genetic Programming
Introduction
The search for QSPR models to predict influence of structures of reactants and solvents on reaction rate constant has long been a challenge. According to Roy et al. (2015), QSPR (Quantitative Structure Property Relationship) models are generally linear or non-linear mathematical relationships that correlate a particular property or activity of a chemical species with their structure. Such structures are generally represented numerically by descriptors, which can be determined experimentally or theoretically as per the definition. Early attempts to develop QSPR models for the prediction of rate constant of a reaction have been limited. Either the rate constant was studied as a function of structures of reactants while keeping the solvent structure constant or the solvent structures were varied but the reactants’ structures were kept constant. With regards to the study of the influence of reactants’ structures, Chaudry and Popelier (2003) developed a property model to predict the rate constant of hydrolysis of esters by utilizing quantum chemical descriptors. Estrada and Matamala (2007) have used generalized topological indices to predict the gas phase reaction rate constants of alkanes and cycloalkanes with OH, Cl and NO3 radicals. Recently, Struebing et al. (2013) developed a methodology to design solvents by utilizing surrogate models and quantum chemical calculations. There is a need for QSPR models that capture the combined influence of reactants’ structures and solvent structures. Such models will serve two purposes: The first would be to quickly predict the rate constant without relying fully on experiments, while the second purpose will be the simultaneous design of reactants, products and solvents. With respect to QSPRs that capture reactant and solvent influence, Nandi et al. (2013) developed a quantitative structure-activation barrier relationship for the Diels-Alder reaction that utilizes quantum chemical descriptors. Their aim was to construct a relationship between the activation energy and the structures of the utilized reactants and solvent. However, their data set lacked solvent variety. Recently, Zhou et al. (2014) have studied a variety of solvents for the Diels-Alder reaction in their search for new solvent descriptors though they only used one set of reactants. Thus, we have combined the data sets utilized by Nandi et al. (2013) and Zhou et al. (2014) and created a set which offers more diversity in terms of the solvents utilized. We have utilized this more diverse data set to develop a rate constant model in terms of connectivity indices. It is worth noting that Nandi et al. (2013) relied on the data set utilized in the work of Tang et al. (2012). In our previous work, a hybrid GA-DT algorithm was designed to develop a linear model (Datta et al., 2017). In this work, we have proposed an efficient Multi-gene Genetic Programming (MGGP) algorithm using initialization by a modified DT algorithm for model development which utilizes the “divide and conquer” strategy in combination with Principal Component Analysis. This DT algorithm checks if the addition of branched gene improves model fitness. The MGGP algorithms hold promise as they possess the ability of developing models using a wide variety of nonlinear mathematical basis functions. Both internal and external validations were performed separately to determine model confidence. Additionally, model RMSE and R2 values in case of both external and internal validation were calculated as they describe model fitness of data.
The Diels-Adler reaction is a well-studied organic chemical reaction involving a conjugate diene and an alkene, which is also termed as dienophile. Evans and Johnson (1999) have discussed this reaction in detail in their work. This reaction involves cycloaddition of two reactants in the presence of a solvent. Earlier, Rideout and Breslow (1980) discussed the hydrophobic acceleration of the Diels-Alder reaction in their work. Specifically, the influence of hydrophobic cavity in organic structures for acceleration of the reaction rate was discussed. In both of the aforementioned works, the impact of the solvent on the rate constant of the reaction has been observed. This feature of this reaction makes it an appealing choice for the aimed study.
Section snippets
Methodology
From the work of Tang et al. (2012) and Zhou et al., 2014, Zhou et al., 2015, we generated a diverse data set of 72 reactions that consisted of 38 different dienophiles, 19 dienes and 10 solvents. All chemical species were designed using Avogadro software. The structures were optimized using MMFF94s, a built-in geometry optimization algorithm of Avogadro software, as suggested by Datta et al. (2015). The optimized geometries were saved as MOL files. These files were then used as input for
Results
As GPTIPS 2.0 gives the opportunity to use various numbers and depths of the genes, the objective was to develop a better model than presented by Datta et al. (2017) using a nonlinear model that contained the minimum number of descriptors possible. Although only connectivity index descriptors were used in this effort, it is beneficial to develop a model with fewer descriptors to increase the interpretability of the model. The population size and number of iterations were kept constant at
Conclusions
In this project, the aim was to develop a better model than that previously proposed for the same property using QSPR analysis (Datta et al., 2017). From the work of Datta et al. (2017), it was clear that the hybrid GA-DT method was very efficient in developing linear models of this type. Dev et al. (2017) also concluded that the hybrid GA-DT method provided the best possible model, thus developing a nonlinear model was the only option left to develop a better property model. From the results
References (18)
- et al.
Data Mining and Regression Algorithms for the Development of a QSPR Model Relating Solvent Structure and Ibuprofen Crystal Morphology
Computer-Aided Chemical Engineering
(2015) - et al.
Comparison of Tree Based Ensemble Machine Learning Methods for Prediction of Rate Constant of Diels-Alder Reaction
Computer Aided Chemical Engineering
(2017) - et al.
Generalized Topological Indices. Modeling Gas-Phase Rate Coefficients of Atmospheric Relevance
J. Chem. Inf. Model.
(2007) - et al.
A new multi-gene genetic programming approach to nonlinear system modeling. Part I: materials and structural engineering problems
Neural Computation and application
(2012) - et al.
Robust design of optimal solvents for chemical reactions-A combined experimental and computational strategy
Chemical Engineering Science
(2015) - et al.
Ester Hydrolysis Rate Constant Prediction from Quantum Topological Molecular Similarity Descriptors
J. Chem. Phys. Chem. A
(2003) - et al.
Hybrid genetic algorithm-decition tree approach for rate constant prediction using structures of reactants and solvents for Diels-Alder reaction
Computers and Chemical Engineering
(2017) - et al.
Diels-Alder Reactions
Chapter 33.1, Comprehensive Asymmetric Catalysis I-III
(1999) - et al.
Multi-stage genetic programming: A new strategy to nonlinear system modeling
Information Sciences
(2011)
Cited by (1)
Machine Learning-Based Screening for Potential Singlet Fission Chromophores: The Challenge of Imbalanced Data Sets
2023, Journal of Physical Chemistry Letters