1 Introduction

A common remedy to expensive optimization problems is to replace exact evaluations with approximations learned from past evaluations. Theoretically, any supervised learning techniques can be used for building up surrogate models, for instance, Gaussian processes or Kriging [16], random forest [7], supported vector machine [4], genetic programming based (GP-based) symbolic regression [9], to name a few. Among these techniques, the Kriging/Gaussian process is appealing both in academic studies and in applications due to its estimation of both prediction and the corresponding uncertainty as outputs. The utilization of these two outputs allows a balancing property of exploitation and exploration in an acquisition function (a.k.a infill criteria in some papers) during the optimization processes.

Compared with the black-box modeling technique of Kriging, the transparent-box modeling technique has been a hot topic recently due to its ability to interpret the relationship between inputs and predictions. Among many techniques of transparent-box modeling, GP-based symbolic regression (SR) attracts many researchers’ attention, as it searches the space of all possible mathematical formulas for concise or closed-form mathematical expressions that best fit a dataset [1]. To quantify prediction uncertainty in GP-based symbolic regression, a simple technique is to utilize bootstrapping, which is widely used in the machine learning field. Bootstrapping is usually used in ensemble methods to reduce the bias of a (simple) predictor by using several predictors on different samples of a dataset and averaging their predictions. The set of predictions from each predictor can be used to estimate the properties of a predictor (such as prediction mean and its variance) by measuring those properties when sampling from an approximating distribution. For example, this technique is incorporated into a random forest (RF) in Sequential Model-Based Algorithm Configuration (SMAC) [12], where the variance is calculated using the predictions of the trees of the random forest. Following the same idea, the prediction variance of GP-based symbolic regression can also be estimated by the method of bootstrapping. Such similar works can be found in [2, 8, 10]. Another technique to quantify the uncertainty of the prediction’s residual is achieved in [19], in which a so-called GP-based symbolic regression with Kriging (GP2) was proposed by summing a GP-based symbolic expression and one additional residual term to regulate the predictions of the symbolic expression, where the residual term follows a normal distribution and is estimated by Kriging.

The techniques mentioned above allow GP-based symbolic regression to employ acquisition functions that balance exploration and exploitation during the optimization process. Under the framework of surrogate-assisted multi-objective optimization algorithm (SMOA), this paper utilizes the most recently modeling technique GP2 [19] and GP-based symbolic regression incorporated with a bootstrapping technique as the surrogate models to solve multi-objective optimization problems (MOPs). These modeling techniques are compared with two state-of-the-art modeling techniques, Kriging, and random forest.

The main contribution of this paper is extending the surrogate models that allow an acquisition function’s ability to balance exploration and exploitation in the field of SMOA. This paper is structured as follows: Sect. 2 describes the preliminaries of multi-objective optimization problems, and introduces the framework of surrogate-assisted multi-objective optimization algorithm; Sect. 3 explains three different techniques for surrogate models; Sect. 4 introduces the definition of upper confidence bound (UCB) [6, 17] and its relating concepts; Sect. 5 shows the parameter settings and discusses the experiment results.

2 Surrogate-Assisted Multi-objective Optimization

This section introduces the preliminaries of the SMOA, including the MOP’s definition in Sect. 2.1, and the framework of SMOA in Sect. 2.2.

2.1 Multi-objective Optimization Problem

A continuous MOPFootnote 1 is defined as minimizing multiple objective functions simultaneously and can be formulated as:

$$\begin{aligned} \mathop {\mathrm {arg\,min}}\limits _{\textbf{x}} \quad \textbf{f}(\textbf{x}):= \big ( {f}_1(\textbf{x}),\cdots ,{f}_m(\textbf{x}) \big ), \qquad \qquad \textbf{x} \in \mathcal {X} \subseteq \mathbb {R}^d \end{aligned}$$
(1)

where m is the number of objective functions, \({f}_i\) stands for the i-th objective functions \(f_i: \mathcal {X} \rightarrow \mathbb {R}\), \(i = 1, \dots , m\), and \(\mathcal {X}\) is a decision vector subset.

2.2 Framework of Surrogate-Assisted Multi-objective Optimization

The fundamental concept of SMOA is to firstly build a surrogate model \(\mathcal {M}\) to reflect the relationship between a decision vector \(\textbf{x} = (x_1, \cdots , x_d)\) and its each corresponding objective value \(y_i = f_i(\textbf{x}), i\in \{1,\cdots ,m\}\) for each objective. In SMOA, it is usually assumed that objective functions are mutually independent in an objective spaceFootnote 2.

SMOA starts with sampling an initial design of experiment (DoE) with a size of \(\eta \) (line 2 in Algorithm 1), \(\textrm{X}=\{\textbf{x}^{(1)},\textbf{x}^{(2)},\) \(\ldots , \textbf{x}^{(\eta )}\} \subseteq \mathcal {X}\). By using the initial DoE, \(\textrm{X}\) and its corresponding objective values, \(\textbf{Y}=\{\textbf{f}(\textbf{x}^{(1)}),\textbf{f}(\textbf{x}^{(2)}),\cdots ,\textbf{f}(\textbf{x}^{(\eta )})\} \subseteq \mathbb {R}^{m\times \eta }\) (line 3 in Algorithm 1), can be then utilized to construct surrogate models \(\mathcal {M}_i\). Then, an SMOA starts with searching for a decision vector set \( \textbf{x} '\) in the search space \(\mathcal {X}\) by maximizing the acquisition function \(\mathscr {A}\) with parameters of \(\gamma \) and surrogate models (line 7 in Algorithm 1). Here, an acquisition function quantifies how good or bad a point \( \textbf{x} \) is in objective space. In this paper, \(\mathscr {A}\) is the upper confidence bound (described in Sec. 4) due to its ability to balance exploration and exploitation, low computational complexity, and popularity in the deep learning fields.

figure b

By maximizing the acquisition function, a single-objective optimization algorithm is deployed to search for the optimal decision vector \( \textbf{x} ^*\). In this paper, we use BI-population CMA-ES to search for the optimal \( \textbf{x} ^*\) due to the favorable performance on BBOB function testbed [11]. The optimal decision vector \( \textbf{x} ^*\) will then be evaluated by the ‘true’ objective functions \( \textbf{f} \). The surrogate models will be retrained by using the updated \(\textrm{X}\) and \(\textbf{Y}\). The main loop (as shown in Algorithm 1 from line 6 to line 12) will not stop until a stopping criterion is satisfied. Common stopping criteria include a number of iterations, convergence velocity, etc. In this paper, we specify the stopping criterion (\(T_c\)) as a number of function evaluations.

3 Surrogate Models

This section introduces a common surrogate model, Kriging in Sect. 3.1; a canonical GP-based symbolic regression in Sect. 3.2; and the GP-based symbolic regression with Kriging (GP2) in Sect. 3.3. Additionally, the method of quantifying prediction uncertainty is introduced in each subsection.

3.1 Kriging

Kriging is a statistical interpolation method and has been proven to be a popular surrogate model to approximate noise-free data in computer experiments. Considering a realization of y at n locations, expressed as the following vector \(\textbf{f}(\textbf{X}) =( {f}(\textbf{x}^{(1)}) \cdots , {f}(\textbf{x}^{(n)}) )\) and \(\textbf{X}=\{\textbf{x}^{(1)}, \cdots , \textbf{x}^{(n)}\} \subseteq \mathcal {X}\), Kriging assumes \(\textbf{f}(\textbf{X})\) to be a realization of a random process f of the following form [5, 13]:

$$\begin{aligned} f(\textbf{x}) = \mu (\textbf{x}) + \epsilon (\textbf{x}), \end{aligned}$$
(2)

where \(\mu (\textbf{x})\) is the estimated mean value and \(\epsilon (\textbf{x})\) is a realization of a Gaussian process with zero mean and variance \(\sigma ^2\).

The regression part \(\mu (\textbf{x})\) approximates the function f(.) globally. The Gaussian process \(\epsilon (\textbf{x})\) takes local variations into account and estimates the uncertainty quantification. The correlation between the deviations at two decision vectors (\(\textbf{x}\) and \(\mathbf {x'}\)) is defined as:

$$ Corr[\epsilon (\textbf{x}),\epsilon (\mathbf {x'})] = R(\textbf{x},\mathbf {x'}) = \prod _{i=1}^m R_i(x_i,x_i'), $$

where R(., .) is the correlation function that decreases with the distance between two points (\( \textbf{x} \text { and } \textbf{x} '\)).

It is common practice to use a Gaussian correlation function (a.k.a., a squared exponential kernel):

$$ R(\textbf{x},\mathbf {x'}) = \prod ^m_{i=1} \text {exp}(-\theta _i(x_i - x'_i)^2), $$

where \(\theta _i \ge 0\) are parameters of the correlation model to reflect the variables’ importance [21]. The covariance matrix can then be expressed through the correlation function: \(Cov({\epsilon (\textbf{x})}) = \sigma ^2 \mathbf {\Sigma }\), where \( \mathbf {\Sigma }_{i,j}=R(\textbf{x}_i,\textbf{x}_j)\).

When \(\mu (\textbf{x})\) is assumed to be an unknown constant, the unbiased prediction is called ordinary Kriging (OK). In OK, the Kriging model determines the hyperparameters \(\mathbf {\theta } = (\theta _1, \theta _2, \cdots , \theta _n)\) by maximizing the likelihood over the observed dataset. The expression of the likelihood function is: \(L = -\frac{n}{2}\ln ({\sigma }^2) - \frac{1}{2}\ln (|\mathbf {\Sigma }|) \).

Uncertainty Quantification: The maximum likelihood estimations of the mean \(\hat{\mu }\) and the variance \(\hat{\sigma }^2\) can be expressed by: \( \hat{\mu } = \frac{\textbf{1}^{\top }_n \mathbf {\Sigma }^{-1} \textbf{y}}{\textbf{1}^{\top }_n \mathbf {\Sigma }^{-1} \textbf{1}_n} \) and \( \hat{\sigma }^2 = \frac{1}{n}(\textbf{y} - \textbf{1}_n\hat{\mu })^{\top }\mathbf {\Sigma }^{-1}(\textbf{y}-\textbf{1}_n\hat{\mu }) \). Then the prediction of the mean and the variance of an OK’s prediction at a target point \(\textbf{x}^t\) can be derived as follows [13, 20]:

$$\begin{aligned} \mu _{ok}(\textbf{x}^t) =&\hat{\mu } + \textbf{c}^{\top } \mathbf {\Sigma }^{-1} (\textbf{y} - \hat{\mu }\textbf{1}_n), \end{aligned}$$
(3)
$$\begin{aligned} \sigma _{ok}^2(\textbf{x}^t) =\ {}&\hat{\sigma }^2[1 - \textbf{c}^{\top } \mathbf {\Sigma }^{-1}\textbf{c} + \frac{1-\textbf{c}^{\top }\Sigma ^{\top }\textbf{c}}{\textbf{1}_n^{\top }\Sigma ^{-1}\textbf{1}_n}], \end{aligned}$$
(4)

where \(\textbf{c} = (Corr(y(x^t),y(x_1)), \cdots , Corr(y(x^t),y(x_n)))^{\top }\).

3.2 Genetic Programming Based Symbolic Regression

Genetic programming (GP) [15] is a typical evolutionary algorithm to solve optimization problems that can be formulated as: \(\mathop {\mathrm {arg\,min}}\limits _{\textbf{x}} f(\textbf{x}), \textbf{x} \in \mathcal {X}\), where \(\textbf{x} = (x_1, \cdots , x_n)\) represents a decision vector (also known as individual) in evolutionary algorithms (EAs). Similar to other EAs, GP evolves a population of solution candidates Pop(.) by following the principle of the survival of the fittest and utilizing biologically-inspired operators. Algorithm 2 is the pseudocode of GP, where Variation operator includes crossover/recombination and mutation operators.

figure d

The feature that distinguishes GP from other evolutionary algorithms is the variable-length representation for \(\textbf{x}\). The search space of GP-based symbolic regression problems is a union of a function space \(\mathcal {F}\) and a terminal space \(\mathcal {T} = \{\mathcal {S} \cup \mathbb {R} \} \), where \(\mathcal {S}\) represents a variable symbol space (e.g., \(x_1\)) and \(c \in \mathbb {R}\) represents a constant number. The function set \(\mathcal {F}\) includes available symbolic functions (e.g., \(\{+, -, \times , \div , \exp , \log \}\)).

To distinguish the objective function and decision variables in MOPs, \(f_{sr}(\cdot )\) and \(\textbf{t}\) are used to represent the objective function in GP-based symbolic regression and a GP individual for symbolic regression problems, respectively. The objective function \(f_{sr}(\cdot )\) can be any performance metrics for supervised learning problems. For instance, mean squared error (MSE), the Pearson correlation coefficient (PCC), Coefficient of determination (R2), etc. In this paper, MSE, PCC, and R2 are used as the objective functions in GP-based symbolic regression, and they are defined as:

\(\text{ MSE } = \frac{1}{n}\sum _{i=1}^{n}(y_i - \hat{y}_i)^2\), \(\text{ PCC } = \frac{ \sum _{i=1}^{n}(y_i-\bar{y})(\hat{y}_i-\bar{\hat{y}}) }{\sqrt{\sum _{i=1}^{n}(y_i-\bar{y})^2}\sqrt{\sum _{i=1}^{n}(\hat{y}_i-\bar{\hat{y}})^2}}\), \(\text{ R2 } = 1 - \frac{\sum _{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum _{i=1}^{n}(y_i - \bar{y})^2}\),

where n is the number of samples, \(y_i\) and \(\hat{y_i}\) represent the real response and predicted response of the i-th sample, \(\bar{y} = \frac{1}{n}\sum _{i=1}^n y_i\), and \(\bar{\hat{y}} = \frac{1}{n}\sum _{i=1}^n \hat{y}_i\).

Therefore, a GP-based symbolic regression problem in this paper can be well formulated as:

$$\begin{aligned}&\mathop {\mathrm {arg\,max}}\limits _{\textbf{t}} f_{sr}(\textbf{t}), \quad t_i \in \{\mathcal {F}\cup \mathcal {T} \}|_{i=1,2,\cdots } \qquad \text{ if } \quad f_{sr} \in \{\text{ PCC }, \text{ R2 }\} \text{ or } \\&\mathop {\mathrm {arg\,min}}\limits _{\textbf{t}} f_{sr}(\textbf{t}), \quad t_i \in \{\mathcal {F}\cup \mathcal {T} \}|_{i=1,2,\cdots } \qquad \text{ if } \quad f_{sr} = \text{ MSE }. \end{aligned}$$

Uncertainty Quantification: Similar to other bootstrapping techniques, each individual in the population of the GP-based symbolic regression is trained on a list of bootstrap samples. The variance error is estimated according to the error on the bootstrap samples. In this paper, the detailed procedure of bootstrapping for GP-based symbolic regression is as follows:

  1. 1.

    Resampling \(n_b \le n\) samples (to form a new dataset \(D'\)) from original training dataset D of n samples;

  2. 2.

    Train the GP-based symbolic regression model (\( \textbf{t} \)) based on dataset \(D'\);

  3. 3.

    Repeat (1) to (2) for \(n_{pop}\) times;

  4. 4.

    The bagging prediction and variance at a query vector \( \textbf{x} ^t\) are then computed by the mean and variance over all \(n_{pop}\) predictions:

    $$\begin{aligned}&\mu _{sr}( \textbf{x} ^t) = \frac{1}{n_{pop}}\sum _{i=1}^{n_{pop}} \textbf{t} _i( \textbf{x} ^t) \end{aligned}$$
    (5)
    $$\begin{aligned}&\sigma _{sr}^2( \textbf{x} ^t) = \frac{1}{n_{pop}} \sum _{i=1}^{n_{pop}} \big ( \textbf{t} _i( \textbf{x} ^t) - \mu _{sr}( \textbf{x} ^t ) \big ) ^2 \end{aligned}$$
    (6)

3.3 GP-Based Symbolic Regression with Kriging

GP-based symbolic regression with Kriging (GP2) integrates GP-based symbolic regression and Kriging by summing a best-fitted expression (\(symreg( \textbf{x} )\)) on a training dataset and a residual term. The prediction of GP2 can be expressed as:

$$\begin{aligned} \text {GP}^2_{symreg}(\textbf{x}) = symreg(\textbf{x}) + res(\textbf{x}), \end{aligned}$$
(7)

where \(res(\textbf{x})\) is the residual-regulating term that follows a normal distribution and is obtained by Kriging.

The pseudocode of the GP2 is shown in Algorithm 3. The algorithm starts from generating the training dataset D by computing the real responses \(\textbf{Y}\) of the decision set X. The best-fitted symbolic expression \(symreg( \textbf{x} )\) on dataset D is generated by GP-based symbolic regression (at line 4 in Algorithm 3). The residuals of the best-fitted symbolic expression \(symreg( \textbf{x} )\) on \(\textbf{X}\) can be computed by \(\textbf{Y} - \hat{\textbf{Y}}\). \(\textbf{X}_{gp}\) (line 6 in Algorithm 3) represents the variables that affect the residuals of \(symreg( \textbf{x} )\). The most intuitive way is to set \(\textbf{X}_{gp}\) as \(\textbf{X}\). In addition, it is also reasonable to set \(\textbf{X}_{gp}\) as \(\hat{\textbf{Y}}\) and \(\{ \textbf{X} \cup \hat{\textbf{Y}}\}\), due to the fact that residual of \(symreg( \textbf{x} )\) highly depends on the the prediction \(\hat{\textbf{Y}}\). Then, Kriging models the relationship between the \(\textbf{X}_{gp}\) and the residuals of \(symreg( \textbf{x} )\) in the dataset D (at line 7 in Algorithm 3). The residual of \(symreg( \textbf{x} )\), noted as \(res( \textbf{x} )\), also depends on \(\textbf{X}\), as \(\textbf{X}_{gp}\) depends on \(\textbf{X}\). The final surrogate model built by the GP2, noted as \(\text {GP}^2_{symreg}( \textbf{x} )\), is the sum of \(symreg( \textbf{x} )\) and its corresponding residual distribution \(res( \textbf{x} )\).

figure e

Uncertainty Quantification: The prediction of the symbolic expression of GP2 is a sum of symbolic expression and residual term that follows a normal distribution. The prediction and variance at a query vector \( \textbf{x} ^t\) by using the GP2 strategy are:

$$\begin{aligned} \mu _{GP2}( \textbf{x} ^t)&= symreg( \textbf{x} ^t) + \mu _{ok}( \textbf{x} ^t), \end{aligned}$$
(8)
$$\begin{aligned} \sigma ^2_{GP2}( \textbf{x} ^t)&= s_{sr}^2( \textbf{x} ^t) + \sigma ^2_{ok}( \textbf{x} ^t), \end{aligned}$$
(9)

where \(s_{sr}^2( \textbf{x} ^t)\) is the variance of the symbolic regression expression \(symreg( \textbf{x} )\) at \( \textbf{x} ^t\).

In this paper, we argued \(s_{sr}^2( \textbf{x} ) = 0\) due to the fact that \(symreg( \textbf{x} )\) is an explicit mathematical expression and there is no variance to \(symreg( \textbf{x} )\) at an arbitrary point \( \textbf{x} \). Therefore, Eq. (9) can be simplified as \(\sigma ^2_{GP2}( \textbf{x} ^t) = \sigma ^2_{ok}( \textbf{x} ^t)\).

4 Acquisition Functions

The Hypervolume Indicator (HV), introduced in [22], is an essentially unary indicator for evaluating the quality of a Pareto-front approximation set in MOPs due to its Pareto compliant propertyFootnote 3 and no requirement of in-advance knowledge of the Pareto front. The maximization of HV leads to a high-qualified and diverse Pareto-front approximation set. The Hypervolume Indicator is defined as follows:

$$\begin{aligned} {\text {HV}} (\mathcal{P}\mathcal{F},\textbf{r}) :=\lambda _m(\cup _{ \textbf{y} \in \mathcal{P}\mathcal{F}}[ \textbf{y} , \textbf{r}]), \end{aligned}$$
(10)

where \(\lambda _m\) is the Lebesgue measure and the reference point \( \textbf{r} \) clips the space to ensure a finite volume.

The Hypervolume improvement (HVI) measures the improvement of a new solution in terms of HV and is defined as:

$$\begin{aligned} \text{ HVI } (\textbf{y}; \mathcal{P}\mathcal{F}, \textbf{r}) := \text{ HV }(\mathcal{P}\mathcal{F}\cup \{\textbf{y}\}; \textbf{r}) - \text{ HV }(\mathcal{P}\mathcal{F}; \textbf{r}) \end{aligned}$$

The Upper confidence bound of the HVI (UCB) [6, 17] is an indicator based on the HVI and prediction uncertainty in a naïve way. Denoting \( \boldsymbol{\mu } ( \textbf{x} )=(\mu _1( \textbf{x} ), \ldots , \mu _m( \textbf{x} ))\) as the prediction and \({ \boldsymbol{\sigma } }( \textbf{x} )=({\sigma }_1( \textbf{x} ),\ldots , {\sigma }_m( \textbf{x} ))\) as the uncertainty of m independent surrogate models, the hypervolume improvement of upper confidence bound of a solution at \( \textbf{x} \), i.e., \({ \boldsymbol{\mu } }( \textbf{x} ) - \omega { \boldsymbol{\sigma } }( \textbf{x} )\), \(\omega \in \mathbb {R}_{\ge 0}\) specifies the confidence level:

$$\begin{aligned} \text{ UCB }( \textbf{x} ; \mathcal{P}\mathcal{F}, \textbf{r} , \omega ) := \text{ HVI }({ \boldsymbol{\mu } }( \textbf{x} ) - \omega { \boldsymbol{\sigma } }( \textbf{x} ); \mathcal{P}\mathcal{F}, \textbf{r} ). \end{aligned}$$
(11)

The parameter \(\omega \) controls the balancing weight between exploration and exploitation, and a positive \(\omega \) rewards the high prediction uncertainty.

Example 1

Two examples of the landscape of HVI and UCB are shown in Fig. 1. The Pareto-front approximation set is denoted as \(\mathcal{P}\mathcal{F}= \{{\textbf{y}^{(1)}} = (1,3),\) \(\textbf{y}^{(2)}=(2, 2), \textbf{y}^{(3)} = (3, 1)\}\), the reference point is \(\textbf{r}=(4,4)\), and the same uncertainty (\(\boldsymbol{\sigma } = (1,1)\)) is used for every point in the objective space, and \(\omega \) is set as 1 for the UCB. The left and right figures show the landscape of HVI and UCB, respectively, in the objective space that ranges from −1 to 5 for both \(f_1\) and \(f_2\).

Fig. 1.
figure 1

Landscape of HVI and UCB.

5 Experiments

In this paper, the algorithms are performed on ZDT series problems [23] with the parameters of \(d=5\), \(m=2\), and a fixed reference point \(\textbf{r}=[15,15]\). In this paper, four categories of surrogate models are compared, including Kriging (in Alg. I), five different GP2 variants by using different configurations (in Alg. II − VI), random forest (in Alg. VII), and GP-based symbolic regression (in Alg. VIII and IX). Each experiment consists of 15 independent runs.

5.1 Parameter Settings

The common parameters in all algorithms include DoE size \(\eta =30\) and the function evaluation budget \(T_c=200\). The stopping criteria of \(f_{sr}\) is 0.1, 0.99, 0.99 for MSE, PCC, and R2 metrics, respectively. The function evaluation of genetic programming in GP-based symbolic regression and GP2 is 4E3 [19] and 2E4 [18], respectively.

The parameter \(\omega \) of UCB is set as \(\sqrt{g /\log (g)}\) [14], where g is the number of function evaluation. The function space \(\mathcal {F}\) for symbolic regression is \(\{+, -, \times , \div , \sin , \cos , \text{ pow } \}\). The acquisition function is the UCB and is optimized by CMA-ES in all algorithms. The hyper-parameters of CMA-ES follow: the number of restarts is 3, the number of initialized population size is 7, the maximum generation number is 2000, and the remaining parameters are set in default.

Table 1 shows other parameter settings in each algorithm, where \(n_{pop}\) represents the number of estimators in a random forest. The other parameters (e.g., \(p_m\), \(p_c\), etc.) are set as default values in GPLearn [18].

Table 1. Algorithm Parameter Configuration

5.2 Empirical Results

In this section, we evaluate the Pareto-front approximation sets by using the HV indicator. Figure 2 shows the HV convergence curves of 15 independent runs on ZDT problems. To depict a slight difference in HV values, we discuss the average \(\log (\varDelta HV)\) convergence of HV relative to the reference HV value of 250 in this paper. Notice that the scale varies in different problems since all problems have different ranges of the Pareto fronts.

Fig. 2.
figure 2

Average \(\log (\varDelta \)HV) convergence on ZDT1 – ZDT6. The shaded region represents the variance.

From Fig. 2, it is clear to see that Alg. VII of using the random forest as the surrogate model yields the worst experiment results w.r.t. mean HV and its corresponding standard deviation. Compared with Alg. I and other algorithms, the poor performance of Alg. VII is expected, because of the introduction of Kriging in other algorithms (except for Alg. VIII and IX), of which prediction is more accurate than that of the random forest when the training dataset has few samples. Compared with GP-based symbolic regression algorithms (Alg. VIII and IX), which use ‘plain’ GP, Alg. VII still can not outperform them, due to the efficient optimization mechanism in genetic programming.

On single-modal convex Pareto-front problem ZDT1, Fig. 2 shows that GP-based symbolic regression algorithms (Alg. VIII and IX) are only slightly better than random forest in Alg. VII, but are worse than the other algorithms of using Kriging. The performances of the algorithms (Alg. I – VI) are similar w.r.t. mean HV value over 15 runs. In addition, Alg. I, which merely uses Kriging as the surrogate model, converges the fastest at the early stage of optimization among all the test algorithms. Similarly, Alg. I also converges much faster than most of the test algorithms on the other two single-modal concave Pareto-front problems (ZDT2 and ZDT6). An explanation is the relatively simple landscape of these three test problems.

On discontinued Pareto-front problem ZDT3 and a multi-modal Pareto-front problem ZDT4, the algorithms’ performance differs greatly. Firstly, the convergence of Alg. I is much slower than the other algorithms at the beginning stage of optimization. Since a relatively largeFootnote 4 reference point is used in this paper, the HVI introduced by extreme solutions is definitely much larger than that of a knee point. Thus, we can conclude that the strategy of Kriging with UCB is difficult to search for extreme solutions at the early stage of optimization. Consequently, this will slow the convergence of Alg. I. Additionally, another intuitive explanation is that the landscape of the UCB based on Kriging models on ZDT3 and ZDT4 is not smooth enough to allow the optimizer CMA-ES to jump out of local optima. In other words, the UCB is hard to measure slight differences around the local optima based on Kriging’s prediction because of the landscape’s roughness.

Table 2. Algorithms ranking.

All algorithms’ ranking (w.r.t. mean HV value over 15 runs) on each problem is shown in Table 2. From this table, it is easy to conclude that Alg. V, which utilizes GP2 method of \( \textbf{X} _{gp}=\{ \textbf{X} \cup \hat{ \textbf{Y} } \} \) and \(f_{sr}=\text{ R2 }\), outperforms the other algorithms, and Alg. IV (GP2 method of \( \textbf{X} _{gp}=\{ \textbf{X} \} \) and \(f_{sr}=\text{ R2 }\)) is the second best algorithm. Since the search space of \(\textbf{x}\) of surrogate models in Alg. IV is the same search space of the objective function \( \textbf{f} ( \textbf{x} )\) of MOPs, Alg. IV is more reasonable to interpret the residual regulation term \(res( \textbf{x} )\) (in Eq. (7)) when it is compared with Alg. V. Additionally, there is no significant difference between the performance of Alg. IV and Alg. V on the test problems, see Table 3 for details. Considering these two aspects, it is recommended to use Alg. IV when the acquisition function is the UCB.

Table 3. The pairwise Wilcoxon’s Rank-Sum test (\(+/\approx /-\)) matrix at a 0.05 significance level was performed among all the test algorithms, where algorithms in all the columns (except the first column) are pairwise compared with the algorithms in the first column.

Table 3 shows the performance of pairwise Wilcoxon’s Rank-Sum test (\(+/\approx /-\)) matrix among all test algorithms on the test problems. The sum of Wilcoxon’s Rank-Sum test (sum of \(+/\approx /-\)) confirms that Alg. V performs best, as it significantly outperforms 23 pairwise instances between algorithms and problems. Besides, compared with Algo. I, most of the GP2 methods (Alg. II – Alg. V) perform at least similar to or significantly better than Alg. I.

6 Conclusion and Future Work

This paper utilizes two different techniques to quantify the prediction uncertainty of GP-based symbolic regression, namely, a bootstrapping technique in GP-based symbolic regression and using the Kriging to estimate the residual uncertainty in GP2. Two surrogate modeling techniques, GP-based symbolic regression, and GP2 of different configurations, are compared with Kriging and random forest on five state-of-the-art MOP benchmarks. The statistical results show the effectiveness of GP2 w.r.t. mean \(\varDelta \)HV convergence. Among five different variants of GP2, we recommend the usage of the GP2 by employing \( \textbf{X} \) or \(\{ \textbf{X} \cup \mathbf {\hat{Y}} \}\) to predict residuals by setting R2 as the fitness function in genetic programming, because of their good statistical performance.

For future work, it is recommended to compare the performance of bootstrapping and Jackknife to estimate the variance of GP-based symbolic regression. It is also worthwhile to investigate the performance of the proposed methods on real-world applications.