A genetic programming-based QSPR model for predicting solubility parameters of polymers

doi:10.1016/j.chemolab.2015.04.005

Chemometrics and Intelligent Laboratory Systems

Volume 144, 15 May 2015, Pages 122-127

https://doi.org/10.1016/j.chemolab.2015.04.005 Get rights and content

Highlights

•
The genetic programming (GP) model accurately predicts the solubility parameters.
•
The GP reconstruct transparent relationship to predict the solubility parameters.
•
The GP captures nonlinear relationships among the molecular descriptors.

Abstract

In this study, linear and nonlinear quantitative structure-property relationship (QSPR) models, respectively called the multiple linear regression based QSPR (MLR-QSPR) model and the genetic programming based QSPR (GP-QSPR) model, were built to predict the solubility parameters of polymers with structure –(C¹H₂–C²R³R⁴)–, as function of some constitutional, topological and quantum chemical descriptors. The results from the internal validation analysis indicated that the GP-QSPR model has better goodness of fit statistics. The external and overall validation measures also confirmed that the GP-QSPR model significantly outperforms the MLR-QSPR model in terms of some performance metrics over the same testing data set, and that genetic programming has good potential to obtain more accurate models in QSPR studies.

Introduction

Prediction of polymers' solubility parameters is of great importance in many technological or industrial applications of polymers [1], [2], [3], [4], [5]. The solubility parameter is an intrinsic physicochemical parameter which is defined simply from Hildebrand–Scatchard solution theory [6], [7]. On the other hand, its experimental determination is not easy and traditional methodologies (e.g. group contribution methods) of predicting polymeric solubility are insufficient to meet accuracy requirements [8], [9], [10], [11]. A few studies in the literature show that, in such cases, the development of predictive quantitative structure–property relationships (QSPR) models using linear or nonlinear data-driven methods (e.g., multiple linear regression, artificial neural networks and fuzzy set theory) seems to be a good alternative to overcome the shortcomings or the limitations of the conventional approaches such as the group contribution methods [12], [13], [14]. As an example, Yu et al. [12] introduced a multiple linear regression based QSPR model for the prediction of solubility parameters of amorphous polymers. They concluded that the presented QSPR model has a valuable ability to correlate the solubility parameters with the six molecular descriptors and its predictions were better than the previous models [13], [14]. However, the value of the correlation coefficient between experimental and predicted solubility parameters was limited to 0.840 during the testing stage of the proposed model.

This study examines the applicability of another data-driven method, which is genetic programming (GP), to predict the solubility of polymers. GP [15] is a purely nonlinear modeling approach that can be described as an extension of well-known genetic algorithm (GA). The main difference between them is the representation of the solution. While GA uses a string of numbers that represent the solution, GP solutions are computer programs. GP creates computer programs to solve a problem using the principle of Darwinian natural selection. It mainly differs from other data driven models (e.g., artificial neural networks) in that it defines an explicit functional relationship between input and output variables by optimizing the model structure and its coefficients simultaneously. To the authors' knowledge, applications of GP in QSPR studies are very few and include prediction of the wavelength of the lowest UV transition for a system of 18 anthocyanidins [16] and sublimation enthalpy of wide range organic contaminants only from their 3D molecular structures [17].

The present work focuses on further development of QSPR models for accurately predicting the solubility parameters of polymers. The GP based QSPR model was developed by using the experimental solubility parameters and molecular descriptors of 97 polymers with structure –(C¹H₂–C²R³R⁴)–, which was previously given in Yu et al. [12]. Its predictive performance was compared with that of multiple linear regression based QSPR model. This study is the first to investigate the implementation of GP in this field.

Section snippets

Theory and calculation

The process of GP starts with a random initial population of computer programs. An individual program present in the population refers to a parse tree, which is generated by the combination of its functions (nodes) and terminals (leaves) that are defined in a function set and terminal set, appropriate to the problem, respectively [15]. A function set may consist of basic arithmetic operators, mathematical functions, conditional operators, Boolean operators, iterative functions and any

Results and discussion

The GP based QSPR model was developed for exploring explicit relationships between the solubility parameter and influencing variables (i.e., molecular descriptors calculated directly from repeating unit structures of polymers). Yu et al. [12] showed that the formulation of the solubility parameter (δ) can be considered to be as follows: $δ = f (h b, alk, n_{N}, Q_{i i}, E_{i n t}, Q_{H})$ where $h b = \frac{m Q_{\pm}}{n^{2}}$ , m is the number of –OH, –NH or –CN group in the side groups, Q_± is hydrogen bond descriptor, n is the number of atoms

Conclusions

This study investigates the development of GP based QSPR model to predict the solubility parameters of polymers. The results show that the predictive performance of the GP based model is better than that of the traditional regression based model, since GP had the ability of effectively capturing complex real-world relationships compared to the conventional regression methods. Apart from the improvements in the prediction performance gained by using GP, this study demonstrates that GP can

Conflict of interest

The authors declare that there is no conflict of interest.

Acknowledgments

The authors thank Xinliang Yu, Xueye Wang, Hanlu Wang, Xiaobing Li and Jinwei Gao for all data used in this work.

References (32)

A.V. Shevade et al.
Molecular modeling of polymer composite-analyte interactions in electronic nose sensors
Sens. Actuators B
(2003)
A. Adjei et al.
Extended Hildebrand approach: solubility of caffeine in dioxane water mixtures
J. Pharm. Sci.
(1980)
K.C. Satyanarayana et al.
Polymer property modeling using grid technology for design of structured products
Fluid Phase Equilib.
(2007)
E.J. Delgado
Predicting aqueous solubility of chlorinated hydrocarbons from molecular structure
Fluid Phase Equilib.
(2002)
B.A. Miller-Chou et al.
A review of polymer dissolution
Prog. Polym. Sci.
(2003)
B.K. Alsberg et al.
A new 3D molecular structure representation using quantum topology with application to structure–property relationships
Chemom. Intell. Lab. Syst.
(2000)
M. Bagheri et al.
Simple yet accurate prediction method for sublimation enthalpies of organic contaminants using their molecular structure
Thermochim. Acta
(2012)
J.R. Koza et al.
Routine high-return human-competitive automated problem-solving by means of genetic programming
Inf. Sci.
(2008)
S.I. Gass et al.
Singular value decomposition in AHP
Eur. J. Oper. Res.
(2004)
K. Roy et al.
QSPR with extended topochemical atom (ETA) indices: modeling of critical micelle concentration of non-ionic surfactants
Chem. Eng. Sci.
(2012)

K. Roy et al.

The rm2 metrics and regression through origin approach: reliable and useful validation tools for predictive QSAR models

Eur. J. Pharm. Sci.

(2014)

P.K. Ojha et al.

Further exploring rm2 metrics for validation of QSPR models

Chemom. Intell. Lab. Syst.

(2011)

M.L. Koç et al.

Prediction of the pH and the temperature-dependent swelling behavior of Ca^2 +-alginate hydrogels by artificial neural networks

Chem. Eng. Sci.

(2008)

M.L. Koç et al.

Genetic algorithms based logic-driven fuzzy neural networks for stability assessment of rubble-mound breakwaters

Appl. Ocean Res.

(2012)

J. Bicerano

Prediction of Polymer Properties

(2002)

G. Inzelt

Conducting Polymers a New Era in Electrochemistry

(2008)

Cited by (17)

A simple correlation for reliable prediction of intrinsic viscosity (limiting viscosity number) of different polymer-solvent combinations
2022, Fluid Phase Equilibria
Citation Excerpt :
Thus, δD, δP, and δH in Eq. (2) can forecast the behavior of such systems more accurately rather than using a single-valued solubility parameter [12]. Group contributions theory and quantitative structure-property relationships (QSPR) methodology are two different approaches, which have been recently developed for the prediction of δ, δD, δP, and δH of polymers and solvents [1,13-16]. Some QSPR methods based on complex descriptors have been developed to predict intrinsic viscosity.
Intrinsic viscosity (limiting viscosity number) of a dilute polymer solution is one of the most fundamental properties of polymers for solubility parameters calculation, determining the molecular weight, size, and topological structure of polymers, and other physicochemical property characterization. The purpose of this work is to introduce a core correlation for predicting intrinsic viscosity of different polymer-solvent combinations from only the structure of the repeating unit structure of polymer and solubility parameter of polymer. Model's reliability can be enhanced by considering two correcting functions based on the effective dispersion, polar, and hydrogen bonding components of solubility for some solvents. The largest available experimental data including intrinsic viscosity of 74 polymer-solvent combinations are used to derive and test the improved model. The predicted intrinsic viscosity of the new model show higher reliability as compared with the results of one empirical and two quantitative structure-property relationships (QSPR) approaches for 65 polymer-solvent combinations. The value of the Root Mean Square Error (RMSE) of the improved model is 24.05 cm³/g, which is less than three comparative models, i.e. 86.08, 37.79, and 34.63 cm³/g. The new method also gives good results as compared to two further complex QSPR models for nine extra polymer-solvent combinations. Moreover, different statistical parameters confirm excellent reliability of the improved correlation as compared to the best accessible predictive methods.
Chemometrics approach for the prediction of chemical compounds’ toxicity degree based on quantum inspired optimization with applications in drug discovery
2019, Chemometrics and Intelligent Laboratory Systems
Citation Excerpt :
This means that each fold contains roughly the same proportions of the phenols types. The last set of experiments was conducted to test the validity and reliability of the suggested prediction model in drug discovery applications using benchmark logD7:4 dataset that was collected from [32,33]. Some comparisons were made between the suggested model and other regression models such as multiple linear regression (MLR), partial least squares (PLS) to predicate lipophilicity, the key physical property for small molecule oral drugs, because it is a key determinant of a range of ADME properties.
Chemometrics, the application of mathematical and statistical methods to the analysis of chemical data, is finding ever widening applications in the chemical process environment. The reliable prediction of toxic effects of chemicals in living systems is highly desirable in domains such as cosmetics, drug discovery, food safety, and the manufacturing of chemical compounds. Toxicity prediction requires several new approaches for knowledge discovery from data to paradigm composite associations between the modules of the chemical compound; the computational demands of such techniques increase greatly with the number of chemical compounds involved. State-of-the-art prediction methods such as neural networks and multi-layer regression require either tuning parameters or complex transformations of predictor or outcome variables and do not achieve highly accurate results. This paper proposes a Quantum Inspired Genetic Programming “QIGP” model to improve prediction accuracy. Genetic Programming is utilized to give a linear equation for calculating the degree of toxicity more accurately. Quantum computing is employed to improve the selection of the best-of-run individuals and handles parsimony pressure to reduce the complexity of solutions. The results of the internal validation analysis indicated that the QIGP model has better goodness of fit statistics then, and significantly outperforms, the Neural Network model.
New prediction methods for solubility parameters based on molecular sigma profiles using pharmaceutical materials
2018, International Journal of Pharmaceutics
Citation Excerpt :
However, this is rather an initial proposal for future updates to assign the molecular group contribution based on more data because there are currently only limited experimental values available. Other recent approaches are a determination of solubility parameters from molecular dynamics simulations (Gupta et al., 2011) or from quantitative structure property relationships (QSPR) (Gharagheizi, 2008; Goodarzi et al., 2010; Járvás et al., 2011; Koç and Koç, 2015). The latter QSPR relationships are based on selecting suitable molecular predictors regarding solubility parameter but this section is often rather arbitrary.
Solubility parameters have been applied extensively in the chemical and pharmaceutical sciences. Particularly attractive is calculation of solubility parameters based on chemical structure and recently, new in silico methods have been proposed. Thus, screening charge densities of molecular surfaces (i.e. so-called σ-profiles) are used by the conductor-like screening model for real solvents (COSMO-RS) and can be employed in a quantitative structure property relationship (QSPR) to predict solubility parameters. In the current study, it was aimed to compare both in silico methods with an experimental dataset of pharmaceutical compounds, which was complemented with own measurements by inverse gas chromatography. An initial evaluation of the total solubility parameters of reference solvents resulted in excellent predictions (observed versus predicted values) with R² of 0.855 (COSMO-RS) and 0.945 (QSPR). The subsequent main study of pharmaceutical compounds exhibited R² values of 0.701 (COSMO-RS) and 0.717 (QSPR). The comparatively lower prediction was to some extent due to the solid state of pharmaceuticals with known conceptual limitations of the solubility parameter and possible experimental bias. Total solubility parameters were also estimated by classical group contribution methods, which had comparatively lower prediction power. Therefore, the new in silico methods are highly promising for pharmaceutical applications.
Multivariate optimization of Pb(II) removal for clinoptilolite-rich tuffs using genetic programming: A computational approach
2018, Chemometrics and Intelligent Laboratory Systems
Citation Excerpt :
GP is an AI technique based on Darwin's selection principles and biological operations (reproduction, crossover, and mutation). It presents advantages over others AI modeling techniques because its scheme prepares tangible and white-box models with are easily interpretable by engineers and scientists [34,35]. In GP mathematical formulas of the population (individuals) is reproduced through generations, preserving the best individuals that eventually evolve [36].
In this study, a genetic programming (GP) model was developed to predict and optimize the Pb(II) removal capacity for natural, sodium, and acid-modified clinoptilolite-rich tuffs. Experimental process evaluated the sorption behavior of lead in aqueous solutions using unmodified and modified natural zeolite considering: the contact time, pH value, lead initial concentration, and sorbent dosage. The GP model was trained and tested with the experimental measurements and subsequently, compared with others multivariate analysis methods using three statistical criteria (coefficient of determination (R²), root mean square error (RMSE), and mean absolute percentage error (MAPE)). The results indicate that GP getting the better performance achieving a fitness of R² = 98.0%, RMSE = 5.06 × 10⁻², and MAPE = 17.58%. Sensitivity analysis (SA) showed that the sorbent dosage was the most influential parameter with a sensitivity index of 0.219, following by the pH (0.059), and contact time (0.031). Based on GP model and SA, a multivariate optimization was conducted to compute the adequate conditions for a required sorption efficiency (98%). Optimize values were obtained at 0.10 g of sorbent mass, pH 5.0, 300.0 mg L⁻¹, and 5.1 min contact time for natural clinoptilolite-rich tuffs; 0.65 g of sorbent mass, pH 5.0, 400.0 mg L⁻¹, and 3.6 min contact time for sodium modified clinoptilolite-rich tuffs; and 0.65 g of sorbent mass, pH 3.0, 400.0 mg L⁻¹, and 71.6 min contact time for acid modified clinoptilolite-rich tuffs. The computational approach presented can perform an assessment with errors less than 6%, indicating that it is a promising tool for the modeling and optimization of the sorption onto zeolite materials minimizing the time and operation cost. The proposed methodology can be used to take appropriate actions in the removing of this toxic heavy metal from the water. Besides, it can be implemented in studies corresponding to other sorption processes or similar.
The Removal of arsenite [As(III)] and arsenate [As(V)] ions from wastewater using TFA and TAFA resins: Computational intelligence based reaction modeling and optimization
2016, Journal of Environmental Chemical Engineering
Citation Excerpt :
An in-depth treatment of the GP-based symbolic regression can be found, for example, in Vyas et al. [18], Poli et al. [28], and Patil-Shinde et al. [29]. There exists a number of studies in chemistry and chemical engineering wherein the GP-based symbolic regression has been employed for developing data-driven predictive models (see, for example, Patil-Shinde et al. [30], Goel et al. [31], Pandey et al. [32], Koç and Koç [33], and Bahrami et al. [34]). It may however be noted that despite its several attractive properties, GP compared to ANNs and SVR formalisms has been utilized infrequently in chemistry and chemical engineering/technology.
Being significantly toxic, removal of arsenic forms an important part of the drinking- and waste-water treatment. Tannin is a polyphenol-rich substrate that efficiently and adsorptively binds to the multivalent metal ions. In this study, tannin-formaldehyde (TFA) and tannin-aniline-formaldehyde (TAFA) resins were synthesized and employed successfully for an adsorptive removal of arsenite [As(III)] and arsenate [As(V)] ions from the contaminated water. Next, a computational intelligence (CI) based hybrid strategy was used to model and optimize the resin-based adsorption of As(III) and As(V) ions for securing optimal reaction conditions. This strategy first uses an exclusively reaction data driven modeling strategy, namely, genetic programming (GP) to predict the extent (%) of As(III)/As(V) adsorbed on TFA and TAFA resins. Next, the input space of the GP-based models consisting of the reaction condition variables/parameters was optimized using genetic algorithm (GA) method; the objective of this optimization was to maximize the adsorption of As(III) and As(V) ions on the two resins. Finally, the sets of optimal reaction conditions provided by GP-GA hybrid method were verified experimentally the results of which indicate that the optimized conditions have lead to 0.3% and 1.3% increase in the adsorption of As(III) and As(V) ions on TFA resin. More significantly, the optimized conditions have increased the adsorption of As(III) and As(V) on TAFA resin by 3.02% and 12.77%, respectively. The GP-GA based strategy introduced here can be gainfully utilized for modeling and optimization of similar type of contaminant-removal processes.
Toxicity: 77 Must-Know Predictions of Organic Compounds: Including Ionic Liquids
2023, Toxicity: 77 Must-Know Predictions of Organic Compounds: Including Ionic Liquids

View all citing articles on Scopus

¹: Tel.: + 90 346 2191010/1318; fax: + 90 346 2191170.

View full text

A genetic programming-based QSPR model for predicting solubility parameters of polymers

Highlights

Abstract

Introduction

Section snippets

Theory and calculation

Results and discussion

Conclusions

Conflict of interest

Acknowledgments

Sens. Actuators B

J. Pharm. Sci.

Fluid Phase Equilib.

Fluid Phase Equilib.

Prog. Polym. Sci.

Chemom. Intell. Lab. Syst.

Thermochim. Acta

Inf. Sci.

Eur. J. Oper. Res.

Chem. Eng. Sci.

Eur. J. Pharm. Sci.

Chemom. Intell. Lab. Syst.

Chem. Eng. Sci.

Appl. Ocean Res.

Prediction of Polymer Properties

Conducting Polymers a New Era in Electrochemistry