A Genetic Programming Approach for Construction of Surrogate Models

https://doi.org/10.1016/B978-0-12-818597-1.50072-2Get rights and content

Abstract

Surrogate models, response surface models or meta-models are ‘lack-’ox models that describe a system with high accuracy. We present a methodology that combines iterative Design of experiments (DOE) with Genetic Programming (GP) in order to obtain surrogate models. GP is an evolutionary technique to create computer programs. In the context of surrogate modelling. the programs are possible functional forms of the model, that are used to fit experimental data. Therefore, unlike most approaches, non-linear combinations of the basis functions are possible. The iterative DOE provides a methodology to choose data points to test current programs and build the next generation. Data is obtained from Aspen Plus based simulations and the process of data acquisition is automatized via Python. The methodology is applied to a RadFrac distillation column which is part of a corn to ethanol process and considers three input and three output variables. The results indicate that the proposed methodology is able to provide accurate surrogate models for the variables.

Introduction

Surrogate models, response surface models or meta-models are black-box models that approximate the behavior of a system by fitting input-output data to combinations of simple functions. The idea is to substitute computationally expensive rigorous models for simpler yet adequately accurate equations. Surrogate modeling has different applications in the context of Process Systems Engineering. For example, surrogate models could be used in the context of optimization of operating conditions and process design variables as well as control. Concrete examples include the work of Caballero and Grossman (2008) that replaced rigorous models of the equipments by their surrogates in modular flowsheet optimization, while Henao and Maravelias (2011) made use of surrogate models of several unit operations (e.g. CSTR, Mixers, Flash vessels) in the context of superstructure optimization of processes for the production of maleic anhydride from benzene.

There are several methods for constructing surrogate models; the most common ones are kriging, also known as Gaussian Process, Support Vector Regression (SVR), radial basis functions, and artificial neural networks (ANN). Kriging was proposed by Krige (1951), and it is based on the generation of a surface based on data points and an assumed prior distribution. The surface thus obtained has the form of a weighted sum of independent basis functions whose error is normally distributed with mean zero. In SVR a surrogate model made of a weighted combination of, usually linear, basis functions are first built. Then, the values of the weights are found by solving an optimization problem that minimizes the norm of the vector of weights and the sum of the deviations of each data point from the model. In ANN, a network that connects inputs and outputs through hundreds of nodes is created. Propagation weights that connect the nodes are learned during a previous training step, and afterwards used in order to make the ANN-based surrogate model able to return the output value when input values are provided.

Recent advances in surrogate modeling include ALAMO and ELM. ALAMO was developed by Cozad et al. (2014) and is an automated learning software of algebraic models for optimization. It identifies surrogate models that are linear combinations of non-linear bases (e. g. polynomial, exponential and logarithmic functions), using experimental or simulation-based data points. ELM is a feed-forward neural network, in which the hidden layers of nodes are randomly chosen. A work by Davis et al. (2018) compared several of these methods by analyzing the surrogate models derived for thirty-five benchmark functions from the Virtual Library of Simulation Experiments (Surjanovic and Bingham 2013). Their results indicate that ANN, ALAMO and ELM, provided the best approximations, while ELM provided the best speed of convergence.

Genetic Programming is a technique to create computer programs (Poli et al., 2008). The main difference between the GP approach and other approaches for building surrogate models, is that GP freely searches for the best functional combination of the basis functions, while the others assume linear combinations of them. Like Genetic Algorithms (GA) is based on evolution: an initial population of solutions is randomly chosen, and subsequent generations are created by fitness-proportionate operations among the individuals of the population. The main difference between GA and GP is that while the former evolves numeric solutions, the latter evolves a program. The most common application of GP is symbolic regression. In such case, GP is used to fit a model to a data set, and it is based on the use of expression trees. Figure 1 shows examples of such trees: every end node (leaf) is either a number (coefficient) or an input variable; the interior nodes correspond to operators. Another (deterministic-based) approach to symbolic regression can be found in Cozad and Sahinidis (2018).

In the context of Process Systems Engineering, GP has been applied to the obtention of dynamic models for a binary distillation column (Willis et al., 1997) and an extruder (Hinchliffe and Willis, 2003), process models of wastewater treatment reactors (Dürrenmatt and Gijer, 2012) and heat transfer correlations (Cai et al., 2006). Although the existence of these previous works, GP is not yet regarded as a tool commonly used for the generation of surrogate models. The aim of this work is to present our initial work on the generation of surrogate models of chemical engineering processes using GP.

The paper is structured as follows: we begin by providing a brief introduction to Genetic Programming in which the basic features and most important operations to generate the programs are summarized. Next, we introduce the approach used in this paper and the application case study. Finally, numerical results for different assumptions are presented.

Section snippets

Genetic Programming

Evolutionary algorithms (EAs, Goldberg, 1989) are a family of stochastic search methods inspired by the natural process of evolution of species. EAs iteratively evolve a set of candidate solutions (which is known as the population) of the optimization problem. The search process involves the probabilistic application of evolutionary operators to find better solutions and it is guided by the survival of the fittest principle. It involves the use of a fitness function that is a metric closely

GP-based generation of a surrogate model

Figure 3 schematizes the procedure for generation of the surrogate model. It begins by selecting the input variables (x0,x1, …, xN, a subset of the process design variables) and the set of mathematical operators that will be included in the model. Following one of the initialization mechanisms described in the previous section Np trees with minimum NLmin and maximum NLmax levels are built.

Input-output data for training the programs is obtained from Aspen Plus simulations. The initial data set

Application: surrogate model for a bioethanol distillation column

As an example of the use of GP for building a surrogate model we present the case of a distillation column (RadFrac) in Aspen Plus. The distillation column is part of the corn to bioethanol process simulation included in the Aspen Plus Resources (bioethanol from corn example). A scheme of the process is presented in Figure 4. The design process variables chosen as input variables are the Reflux Ratio (L / D, x0) bounded between 1 and 3; the Boilup Ratio (V / B, x1) bounded between 0.1 and 1,

Results

Figure 5 shows the progression of the algorithm, for the three design variables, against the number of training points considered for development of the model. Progression in these figures is measured as the RMSE computed using the surrogate model with the best fitness for the number of training points presented in the x-axis evaluated in external evaluation data set (21675 points). The number of training points is related to the number of iterations of the alternative algorithm (i.e., 8 points

Summary and final remarks

In this work we presented a methodology for the generation of surrogate models using Genetic Programming. Two alternatives were considered: in one, experiments are performed before initialization; in the other one an iterative Design of Experiments is performed as part of the learning process.

The methodology was applied to the obtention of surrogates for three output variables of an Aspen Plus Rad-Frac distillation column. The results indicate that both alternatives are able to reach surrogate

References (15)

  • W. Cai et al.

    Heat transfer correlations by symbolic regression

    Int. J. Heat Mass Transf.

    (2006)
  • M.P. Hinchliffe et al.

    Dynamic systems modelling using genetic programming

    Comput. Chem. Eng.

    (2003)
  • J.A. Caballero et al.

    An algorithm for the use of surrogate models in modular flowsheet optimization

    AIChE J.

    (2008)
  • A. Cozad et al.

    Learning surrogate models for simulation-based optimization

    AIChE J.

    (2014)
  • A. Cozad et al.

    A global MINLP approach to symbolic regression

    Mathematical Programming

    (2018)
  • S.E. Davis et al.

    Efficient Surrogate Model Development: Impact of Sample Size and Underlying Model Dimensions

  • D.J. Dürrenmatt et al.

    Automatic reactor model synthesis with genetic programming

    Water Sci. Technol.

    (2012)
There are more references available in the full text version of this article.

Cited by (6)

  • Development of a machine learning-based soft sensor for an oil refinery's distillation column

    2022, Computers and Chemical Engineering
    Citation Excerpt :

    Symbolic regression is probably the main application of GP; an alternative deterministic approach was proposed by Cozad and Sahinidis (2018). Preliminary results of the approach proposed in this article have been reported in Ferreira et al. (2019a) where GP was used to find a surrogate model for an ethanol distillation column using Aspen Plus simulated data, and in Ferreira et al. (2019b) where we compared the use of KP on the benchmark optimization functions proposed in Davis et al. (2018) and the previous ethanol distillation case study, finding that KP performs better than GP and similarly to ANN and ALAMO. As different executions of KP may generate different but equally good models, we make use of the ensemble model concept to have a final model that combines the predictive strengths of the individual ones.

  • Memetic Semantic Genetic Programming for Symbolic Regression

    2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
View full text